The website Kaggle.com, an online community of data scientists, offers many clean, formatted data sets on which analysis can be performed. For this project, I used the Kaggle data set, “House Sales in King County, USA” which includes all home sales from May 2014 to May 2015. The city of Seattle, Washington, USA lies on the border of Kings County. Seattle is notorious for having some of the most expensive and lavish homes in the United States. The data set has a wide variety of homes ranging from small homes to massive mansions containing over 30 rooms.
house <- read.csv('kc_house_data.csv')
head(house)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000 221900 3 1.00 1180 5650
## 2 6414100192 20141209T000000 538000 3 2.25 2570 7242
## 3 5631500400 20150225T000000 180000 2 1.00 770 10000
## 4 2487200875 20141209T000000 604000 4 3.00 1960 5000
## 5 1954400510 20150218T000000 510000 3 2.00 1680 8080
## 6 7237550310 20140512T000000 1225000 4 4.50 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
#p1 <- get_googlemap("king county") %>% ggmap
#p1 + geom_point(data = house, aes(x = long, y = lat), alpha = 0.03, colour = "red")
#ggsave("map.png")
knitr::include_graphics("map.png")
Below are the entire columns in this dataset:
As id, date, latitude, longitude, and zipcode are unnecessary columns for analysis, let’s just drop them.
house <- house %>% dplyr::select(-id, -date, -lat, -long, -zipcode)
head(house)
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1 221900 3 1.00 1180 5650 1 0 0
## 2 538000 3 2.25 2570 7242 2 0 0
## 3 180000 2 1.00 770 10000 1 0 0
## 4 604000 4 3.00 1960 5000 1 0 0
## 5 510000 3 2.00 1680 8080 1 0 0
## 6 1225000 4 4.50 5420 101930 1 0 0
## condition grade sqft_above sqft_basement yr_built yr_renovated sqft_living15
## 1 3 7 1180 0 1955 0 1340
## 2 3 7 2170 400 1951 1991 1690
## 3 3 6 770 0 1933 0 2720
## 4 5 7 1050 910 1965 0 1360
## 5 3 8 1680 0 1987 0 1800
## 6 3 11 3890 1530 2001 0 4760
## sqft_lot15
## 1 5650
## 2 7639
## 3 8062
## 4 5000
## 5 7503
## 6 101930
In datset, there are total 21,613 obeservations, and 16 columns with no missing values.
nrow(house)
## [1] 21613
ncol(house)
## [1] 16
is.null(house)
## [1] FALSE
Let’s reorder columns to make the dataset more readable.
col_order <- c("price", "bedrooms", "bathrooms", "floors", "waterfront", "view", "condition", "grade", "yr_built",
"yr_renovated", "sqft_living", "sqft_lot", "sqft_living15", "sqft_lot15")
house <- house[, col_order]
According to our dataset, bathrooms column had some decimal observations, so let’s round it up.
As yr_built and renovated are columns are continuous variables, each of them stating the year of a house built / renovated, let’s just convert them into categorical variables. For yr_built, we can chunk it up to 5 categories with 20 years of interval, and for renovated, we can note them 1, if renovated, 0 otherwise.
house$bathrooms <- round(house$bathrooms)
house$yr_built <- case_when(
(1900 <= house$yr_built) & (house$yr_built< 1920) ~ 0,
(1920 <= house$yr_built) & (house$yr_built< 1940) ~ 1,
(1940 <= house$yr_built) & (house$yr_built< 1960) ~ 2,
(1960 <= house$yr_built) & (house$yr_built< 1980) ~ 3,
(1980 <= house$yr_built) & (house$yr_built< 2000) ~ 4,
(2000 <= house$yr_built) ~ 5)
house$renovated <- ifelse(house$yr_renovated != 0, 1, 0)
house <- house %>% dplyr::select(-yr_renovated)
Before diving into EDA, let’s split the dataset into train data and test data.
set.seed(1) ##for reproducibility to get the same split
sample<-sample.int(nrow(house), floor(.80*nrow(house)), replace = F)
train<-house[sample, ] ##training data frame
test<-house[-sample, ] ##test data frame
head(train)
## price bedrooms bathrooms floors waterfront view condition grade yr_built
## 17401 550000 3 2 1.5 0 0 3 8 3
## 4775 275000 4 2 2.0 0 0 3 7 4
## 13218 455000 5 2 2.0 0 0 3 6 4
## 10539 384950 3 2 2.0 0 0 3 7 5
## 8462 140000 2 1 1.0 0 0 2 6 2
## 4050 925000 3 2 2.0 0 0 5 7 2
## sqft_living sqft_lot sqft_living15 sqft_lot15 renovated
## 17401 2910 35200 2590 37500 0
## 4775 2120 6754 2120 6937 0
## 13218 1510 3000 1610 3600 0
## 10539 1860 3690 1870 4394 0
## 8462 900 6400 1350 6405 0
## 4050 2690 7000 1800 6435 0
By using ggpairs, we can check overall relationship among columns with the repsonse variable, price, and distribution of each column.
First, for physical attributes of houses (bedrooms, bathrooms, and floors), bathrooms had pretty good correlation with price. Slightly lesser for bedrooms and floors.
house_1 <- train %>% dplyr::select(price, bedrooms, bathrooms, floors)
ggpairs(house_1)
By looking at boxplot across each colum and category, price did not necessarily proportional to number of bedrooms and number of floors. In short, the most pricy house did not have the largest number of bedrooms or floors. However, in terms of number of bathrooms, price tend to increase as the number of bathroom increases. In our dataset, the most pricy house had the largest number of bedrooms. This is the reason why among these three columns, bathrooms had the highest corrleation with price.
p1 <- ggplot(train, aes(x = as.factor(bedrooms), y = price, fill = as.factor(bedrooms))) +
geom_boxplot() +
labs(x = "Number of Bedrooms", y = "Price", title = "Price by Number of Bedrooms", fill = "Bedrooms")
p2 <- ggplot(train, aes(x = as.factor(bathrooms), y = price, fill = as.factor(bathrooms))) +
geom_boxplot() +
labs(x = "Number of Bathrooms", y = "Price", title = "Price by Number of Bathrooms", fill = "Bathrooms")
p3 <- ggplot(train, aes(x = as.factor(floors), y = price, fill = as.factor(floors))) +
geom_boxplot() +
labs(x = "Number of Floors", y = "Price", title = "Price by Number of Floors", fill = "Floors")
ggarrange(p1, p2, p3,
ncol = 1, nrow = 3)
Other three columns (view, condition, waterfront), there was a slight correlation between view and price (0.395). There was also somewhat slight correlation between waterfront and price (0.273). What’s notable is here is that the condition and price of a house had nearly zero correlation (0.015). However, we should be careful when analyzing this figure as nearly zero correlation does not necessarily mean they are totally unrelated, and high correlation does not necessarily lead to a causation, A causes B, or the opposite.
house_2 <- train %>% dplyr::select(waterfront, view, condition, price)
ggpairs(house_2)
According to the boxplot, houses with in the vicinity of waterfront and good view tend to be pricy. However, the condition of a house was not a crucial factor.
p4 <- ggplot(train, aes(x = as.factor(waterfront), y = price, fill = as.factor(waterfront))) +
geom_boxplot() +
labs(x = "Waterfront", y = "Price", title = "Price by with / without waterfront", fill = "Waterfront")
p5 <- ggplot(train, aes(x = as.factor(view), y = price, fill = as.factor(view))) +
geom_boxplot() +
labs(x = "View", y = "Price", title = "Price by View", fill = "View")
p6 <- ggplot(train, aes(x = as.factor(condition), y = price, fill = as.factor(condition))) +
geom_boxplot() +
labs(x = "Condition", y = "Price", title = "Price by Condition", fill = "Condition")
ggarrange(p4, p5, p6, ncol = 1, nrow = 3)
It turns out that among three columns, grade of a house had a notably high correlation with price. Also, yr_built had pretty notable correlation with price. However, renovated had low correlation with price.
house_3 <- train %>% dplyr::select(grade, yr_built, renovated, price)
ggpairs(house_3)
According to the boxplot, there exists a gradual increase in price along grade categories. Also, more newly-built houses tend to have slightly higher prices. However, there was no big difference in price between renovated house and unrenovated house.
p7 <- ggplot(train, aes(x = as.factor(grade), y = price, fill = as.factor(grade))) +
geom_boxplot() +
labs(x = "Grade", y = "Price", title = "Price by Grade", fill = "Grade")
p8 <- ggplot(train, aes(x = as.factor(yr_built), y = price, fill = as.factor(yr_built))) +
geom_boxplot() +
labs(x = "Year built", y = "Price", title = "Price by Year built", fill = "Year built")
p9 <- ggplot(train, aes(x = as.factor(renovated), y = price, fill = as.factor(renovated))) +
geom_boxplot() +
labs(x = "Renovated", y = "Price", title = "Price by Renovated", fill = "Renovated")
ggarrange(p7, p8, p9, ncol = 1, nrow = 3)
When we take a look at sqft_lot and sqft_living columns, sqft_living had a pretty high correlation with price. Houses with large square foot of living generally had higher prices. However, sqft_lot was not highly correlated with price.
house_4 <- train %>% dplyr::select(sqft_living, sqft_lot, price)
ggpairs(house_4)
As such, the slope of sqft_living was pretty steep, while that of sqft_lot was more flat.
p10 <- ggplot(train, aes(x = sqft_living, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Sqft Living", y = "Price", title = "A Scatterplot of Sqft Living vs Price")
p11 <- ggplot(train, aes(x = sqft_lot, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Sqft Lot", y = "Price", title = "A Scatterplot of Sqft Lot vs Price")
ggarrange(p10, p11, ncol = 1, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
In line with sqft columns, sqft_living15 also had pretty notable correlation with price, while sqft_lot15 not.
house_5 <- train %>% dplyr::select(sqft_living15, sqft_lot15, price)
ggpairs(house_5)
Likely, sqft_living15 column had slightly steeper slope than that of sqft_lot15.
p12 <- ggplot(train, aes(x = sqft_living15, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Sqft Living15", y = "Price", title = "A Scatterplot of Sqft Living 15 vs Price")
p13 <- ggplot(train, aes(x = sqft_lot15, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Sqft Lot15", y = "Price", title = "A Scatterplot of Sqft Lot 15 vs Price")
ggarrange(p12, p13, ncol = 1, nrow = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
On top of direct correlation with price, through heatmap, we can check general corrleation among each column. The more red-shaded the square is, the higher correlated two predictors are. On bottom right, predictors (bedrooms, bathrooms, sqft_living, grade, and sqft_living) seem to be pretty correlated with each other.
mydata.cor <- cor(house)
palette = colorRampPalette(c("green", "white", "red")) (20)
heatmap(x = mydata.cor, col = palette, symm = TRUE, main = "A Heatmap of All Columns")
As there are a large number of predictors in our dataset, we can first filter useful predictors by using automated search procedures. We can automate the process of finding useful / useless columns. Let’s implement stepwise regression, forward selection, and backward elimination to choose predictors.
< Stepwise Regression >
regnull <- lm(price ~ 1, data = train)
regfull <- lm(price ~ ., data = train)
step(regnull, scope = list(lower = regnull, upper = regfull), direction = "both")
## Start: AIC=442775.7
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + sqft_living 1 1.1329e+15 1.1553e+15 430962
## + grade 1 1.0242e+15 1.2640e+15 432516
## + sqft_living15 1 7.8905e+14 1.4992e+15 435466
## + bathrooms 1 6.0414e+14 1.6841e+15 437477
## + view 1 3.5639e+14 1.9318e+15 439850
## + bedrooms 1 2.0920e+14 2.0790e+15 441120
## + waterfront 1 1.7097e+14 2.1172e+15 441435
## + floors 1 1.6137e+14 2.1268e+15 441513
## + renovated 1 3.5656e+13 2.2525e+15 442506
## + sqft_lot 1 2.0037e+13 2.2682e+15 442626
## + sqft_lot15 1 1.5906e+13 2.2723e+15 442657
## + yr_built 1 5.6435e+12 2.2826e+15 442735
## + condition 1 3.4753e+12 2.2847e+15 442751
## <none> 2.2882e+15 442776
##
## Step: AIC=430962
## price ~ sqft_living
##
## Df Sum of Sq RSS AIC
## + view 1 9.6282e+13 1.0590e+15 429459
## + grade 1 9.6101e+13 1.0592e+15 429462
## + waterfront 1 9.0018e+13 1.0653e+15 429561
## + yr_built 1 6.8189e+13 1.0871e+15 429912
## + bedrooms 1 3.3062e+13 1.1223e+15 430462
## + renovated 1 1.6775e+13 1.1386e+15 430711
## + sqft_living15 1 1.6529e+13 1.1388e+15 430715
## + condition 1 1.3494e+13 1.1418e+15 430761
## + sqft_lot15 1 6.0106e+12 1.1493e+15 430874
## + sqft_lot 1 3.2768e+12 1.1520e+15 430915
## + bathrooms 1 2.3654e+12 1.1530e+15 430929
## + floors 1 3.1999e+11 1.1550e+15 430959
## <none> 1.1553e+15 430962
## - sqft_living 1 1.1329e+15 2.2882e+15 442776
##
## Step: AIC=429459.5
## price ~ sqft_living + view
##
## Df Sum of Sq RSS AIC
## + grade 1 8.4847e+13 9.7420e+14 428018
## + yr_built 1 4.6244e+13 1.0128e+15 428690
## + waterfront 1 3.8196e+13 1.0208e+15 428826
## + bedrooms 1 2.1640e+13 1.0374e+15 429105
## + renovated 1 1.0244e+13 1.0488e+15 429293
## + condition 1 9.3046e+12 1.0497e+15 429309
## + sqft_living15 1 9.2323e+12 1.0498e+15 429310
## + sqft_lot15 1 6.9146e+12 1.0521e+15 429348
## + sqft_lot 1 3.9236e+12 1.0551e+15 429397
## + bathrooms 1 2.3140e+12 1.0567e+15 429424
## + floors 1 1.7448e+12 1.0573e+15 429433
## <none> 1.0590e+15 429459
## - view 1 9.6282e+13 1.1553e+15 430962
## - sqft_living 1 8.7277e+14 1.9318e+15 439850
##
## Step: AIC=428017.6
## price ~ sqft_living + view + grade
##
## Df Sum of Sq RSS AIC
## + yr_built 1 1.1066e+14 8.6354e+14 425935
## + waterfront 1 4.0072e+13 9.3412e+14 427293
## + condition 1 2.0864e+13 9.5333e+14 427645
## + renovated 1 1.3455e+13 9.6074e+14 427779
## + bedrooms 1 1.1057e+13 9.6314e+14 427822
## + sqft_lot15 1 5.0887e+12 9.6911e+14 427929
## + floors 1 2.7444e+12 9.7145e+14 427971
## + sqft_lot 1 2.6566e+12 9.7154e+14 427972
## <none> 9.7420e+14 428018
## + bathrooms 1 1.1035e+11 9.7408e+14 428018
## + sqft_living15 1 1.4925e+09 9.7419e+14 428020
## - grade 1 8.4847e+13 1.0590e+15 429459
## - view 1 8.5029e+13 1.0592e+15 429462
## - sqft_living 1 1.6553e+14 1.1397e+15 430729
##
## Step: AIC=425934.8
## price ~ sqft_living + view + grade + yr_built
##
## Df Sum of Sq RSS AIC
## + waterfront 1 4.0818e+13 8.2272e+14 425100
## + bedrooms 1 1.1869e+13 8.5167e+14 425698
## + bathrooms 1 6.1747e+12 8.5736e+14 425813
## + floors 1 4.8417e+12 8.5869e+14 425840
## + sqft_lot15 1 3.7083e+12 8.5983e+14 425862
## + sqft_lot 1 2.1378e+12 8.6140e+14 425894
## + renovated 1 1.5630e+12 8.6197e+14 425906
## + condition 1 1.5122e+12 8.6202e+14 425907
## + sqft_living15 1 3.1983e+11 8.6322e+14 425930
## <none> 8.6354e+14 425935
## - view 1 5.0196e+13 9.1373e+14 426910
## - yr_built 1 1.1066e+14 9.7420e+14 428018
## - grade 1 1.4926e+14 1.0128e+15 428690
## - sqft_living 1 1.6156e+14 1.0251e+15 428898
##
## Step: AIC=425099.6
## price ~ sqft_living + view + grade + yr_built + waterfront
##
## Df Sum of Sq RSS AIC
## + bedrooms 1 9.8179e+12 8.1290e+14 424894
## + bathrooms 1 6.7148e+12 8.1600e+14 424960
## + floors 1 4.1674e+12 8.1855e+14 425014
## + sqft_lot15 1 3.8603e+12 8.1886e+14 425020
## + sqft_lot 1 1.9265e+12 8.2079e+14 425061
## + condition 1 1.5979e+12 8.2112e+14 425068
## + renovated 1 7.6988e+11 8.2195e+14 425085
## + sqft_living15 1 6.2102e+11 8.2210e+14 425089
## <none> 8.2272e+14 425100
## - view 1 1.6942e+13 8.3966e+14 425450
## - waterfront 1 4.0818e+13 8.6354e+14 425935
## - yr_built 1 1.1141e+14 9.3412e+14 427293
## - grade 1 1.5190e+14 9.7462e+14 428027
## - sqft_living 1 1.6005e+14 9.8276e+14 428171
##
## Step: AIC=424894
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms
##
## Df Sum of Sq RSS AIC
## + bathrooms 1 1.0157e+13 8.0274e+14 424679
## + sqft_lot15 1 5.2950e+12 8.0760e+14 424783
## + floors 1 4.2569e+12 8.0864e+14 424805
## + sqft_lot 1 2.8889e+12 8.1001e+14 424834
## + condition 1 2.1432e+12 8.1076e+14 424850
## + renovated 1 7.1013e+11 8.1219e+14 424881
## + sqft_living15 1 4.9966e+11 8.1240e+14 424885
## <none> 8.1290e+14 424894
## - bedrooms 1 9.8179e+12 8.2272e+14 425100
## - view 1 1.4821e+13 8.2772e+14 425204
## - waterfront 1 3.8767e+13 8.5167e+14 425698
## - yr_built 1 1.1213e+14 9.2503e+14 427126
## - grade 1 1.3878e+14 9.5168e+14 427617
## - sqft_living 1 1.5617e+14 9.6907e+14 427930
##
## Step: AIC=424678.6
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms
##
## Df Sum of Sq RSS AIC
## + sqft_lot15 1 4.7365e+12 7.9801e+14 424578
## + floors 1 3.2782e+12 7.9946e+14 424610
## + sqft_lot 1 2.6152e+12 8.0013e+14 424624
## + condition 1 1.7809e+12 8.0096e+14 424642
## + sqft_living15 1 1.3301e+12 8.0141e+14 424652
## + renovated 1 2.6833e+11 8.0247e+14 424675
## <none> 8.0274e+14 424679
## - bathrooms 1 1.0157e+13 8.1290e+14 424894
## - bedrooms 1 1.3260e+13 8.1600e+14 424960
## - view 1 1.3572e+13 8.1631e+14 424967
## - waterfront 1 3.9085e+13 8.4183e+14 425499
## - sqft_living 1 1.1018e+14 9.1292e+14 426900
## - yr_built 1 1.2138e+14 9.2412e+14 427111
## - grade 1 1.3191e+14 9.3465e+14 427307
##
## Step: AIC=424578.3
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15
##
## Df Sum of Sq RSS AIC
## + floors 1 2.6691e+12 7.9534e+14 424522
## + condition 1 1.8839e+12 7.9612e+14 424539
## + sqft_living15 1 1.7001e+12 7.9631e+14 424543
## + renovated 1 2.8838e+11 7.9772e+14 424574
## <none> 7.9801e+14 424578
## + sqft_lot 1 1.7180e+09 7.9800e+14 424580
## - sqft_lot15 1 4.7365e+12 8.0274e+14 424679
## - bathrooms 1 9.5989e+12 8.0760e+14 424783
## - view 1 1.3819e+13 8.1182e+14 424873
## - bedrooms 1 1.4684e+13 8.1269e+14 424892
## - waterfront 1 3.9104e+13 8.3711e+14 425403
## - sqft_living 1 1.1487e+14 9.1287e+14 426902
## - yr_built 1 1.1940e+14 9.1740e+14 426987
## - grade 1 1.2849e+14 9.2649e+14 427158
##
## Step: AIC=424522.4
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15 + floors
##
## Df Sum of Sq RSS AIC
## + condition 1 2.5106e+12 7.9283e+14 424470
## + sqft_living15 1 2.1413e+12 7.9319e+14 424478
## + renovated 1 1.5045e+11 7.9519e+14 424521
## <none> 7.9534e+14 424522
## + sqft_lot 1 1.3285e+08 7.9534e+14 424524
## - floors 1 2.6691e+12 7.9801e+14 424578
## - sqft_lot15 1 4.1274e+12 7.9946e+14 424610
## - bathrooms 1 8.7689e+12 8.0410e+14 424710
## - view 1 1.4339e+13 8.0968e+14 424829
## - bedrooms 1 1.4489e+13 8.0982e+14 424833
## - waterfront 1 3.8544e+13 8.3388e+14 425339
## - sqft_living 1 1.1435e+14 9.0969e+14 426843
## - grade 1 1.1700e+14 9.1233e+14 426893
## - yr_built 1 1.1745e+14 9.1279e+14 426902
##
## Step: AIC=424469.7
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15 + floors + condition
##
## Df Sum of Sq RSS AIC
## + sqft_living15 1 2.2608e+12 7.9056e+14 424422
## + renovated 1 4.1564e+11 7.9241e+14 424463
## <none> 7.9283e+14 424470
## + sqft_lot 1 1.0595e+08 7.9283e+14 424472
## - condition 1 2.5106e+12 7.9534e+14 424522
## - floors 1 3.2958e+12 7.9612e+14 424539
## - sqft_lot15 1 4.1731e+12 7.9700e+14 424558
## - bathrooms 1 8.2704e+12 8.0110e+14 424647
## - view 1 1.4194e+13 8.0702e+14 424775
## - bedrooms 1 1.5113e+13 8.0794e+14 424794
## - waterfront 1 3.8519e+13 8.3134e+14 425288
## - yr_built 1 9.9806e+13 8.9263e+14 426518
## - sqft_living 1 1.1395e+14 9.0677e+14 426790
## - grade 1 1.1745e+14 9.1027e+14 426856
##
## Step: AIC=424422.3
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15 + floors + condition +
## sqft_living15
##
## Df Sum of Sq RSS AIC
## + renovated 1 4.9482e+11 7.9007e+14 424414
## <none> 7.9056e+14 424422
## + sqft_lot 1 4.5084e+09 7.9056e+14 424424
## - sqft_living15 1 2.2608e+12 7.9283e+14 424470
## - condition 1 2.6301e+12 7.9319e+14 424478
## - floors 1 3.8081e+12 7.9437e+14 424503
## - sqft_lot15 1 4.5355e+12 7.9510e+14 424519
## - bathrooms 1 9.2553e+12 7.9982e+14 424622
## - view 1 1.2869e+13 8.0343e+14 424700
## - bedrooms 1 1.5167e+13 8.0573e+14 424749
## - waterfront 1 3.9132e+13 8.2970e+14 425256
## - sqft_living 1 8.2208e+13 8.7277e+14 426131
## - grade 1 9.6291e+13 8.8686e+14 426408
## - yr_built 1 1.0173e+14 8.9229e+14 426513
##
## Step: AIC=424413.5
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15 + floors + condition +
## sqft_living15 + renovated
##
## Df Sum of Sq RSS AIC
## <none> 7.9007e+14 424414
## + sqft_lot 1 4.8683e+09 7.9006e+14 424415
## - renovated 1 4.9482e+11 7.9056e+14 424422
## - sqft_living15 1 2.3400e+12 7.9241e+14 424463
## - condition 1 2.9329e+12 7.9300e+14 424476
## - floors 1 3.6017e+12 7.9367e+14 424490
## - sqft_lot15 1 4.5943e+12 7.9466e+14 424512
## - bathrooms 1 8.7444e+12 7.9881e+14 424602
## - view 1 1.2706e+13 8.0278e+14 424687
## - bedrooms 1 1.5062e+13 8.0513e+14 424738
## - waterfront 1 3.8512e+13 8.2858e+14 425234
## - sqft_living 1 8.1831e+13 8.7190e+14 426116
## - yr_built 1 8.9033e+13 8.7910e+14 426258
## - grade 1 9.6056e+13 8.8613e+14 426395
##
## Call:
## lm(formula = price ~ sqft_living + view + grade + yr_built +
## waterfront + bedrooms + bathrooms + sqft_lot15 + floors +
## condition + sqft_living15 + renovated, data = train)
##
## Coefficients:
## (Intercept) sqft_living view grade yr_built
## -6.143e+05 1.635e+02 4.105e+04 1.138e+05 -6.491e+04
## waterfront bedrooms bathrooms sqft_lot15 floors
## 6.004e+05 -4.016e+04 4.429e+04 -6.069e-01 3.309e+04
## condition sqft_living15 renovated
## 2.183e+04 2.810e+01 2.843e+04
< Forward Selection >
step(regnull, scope=list(lower=regnull, upper=regfull), direction="forward")
## Start: AIC=442775.7
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + sqft_living 1 1.1329e+15 1.1553e+15 430962
## + grade 1 1.0242e+15 1.2640e+15 432516
## + sqft_living15 1 7.8905e+14 1.4992e+15 435466
## + bathrooms 1 6.0414e+14 1.6841e+15 437477
## + view 1 3.5639e+14 1.9318e+15 439850
## + bedrooms 1 2.0920e+14 2.0790e+15 441120
## + waterfront 1 1.7097e+14 2.1172e+15 441435
## + floors 1 1.6137e+14 2.1268e+15 441513
## + renovated 1 3.5656e+13 2.2525e+15 442506
## + sqft_lot 1 2.0037e+13 2.2682e+15 442626
## + sqft_lot15 1 1.5906e+13 2.2723e+15 442657
## + yr_built 1 5.6435e+12 2.2826e+15 442735
## + condition 1 3.4753e+12 2.2847e+15 442751
## <none> 2.2882e+15 442776
##
## Step: AIC=430962
## price ~ sqft_living
##
## Df Sum of Sq RSS AIC
## + view 1 9.6282e+13 1.0590e+15 429459
## + grade 1 9.6101e+13 1.0592e+15 429462
## + waterfront 1 9.0018e+13 1.0653e+15 429561
## + yr_built 1 6.8189e+13 1.0871e+15 429912
## + bedrooms 1 3.3062e+13 1.1223e+15 430462
## + renovated 1 1.6775e+13 1.1386e+15 430711
## + sqft_living15 1 1.6529e+13 1.1388e+15 430715
## + condition 1 1.3494e+13 1.1418e+15 430761
## + sqft_lot15 1 6.0106e+12 1.1493e+15 430874
## + sqft_lot 1 3.2768e+12 1.1520e+15 430915
## + bathrooms 1 2.3654e+12 1.1530e+15 430929
## + floors 1 3.1999e+11 1.1550e+15 430959
## <none> 1.1553e+15 430962
##
## Step: AIC=429459.5
## price ~ sqft_living + view
##
## Df Sum of Sq RSS AIC
## + grade 1 8.4847e+13 9.7420e+14 428018
## + yr_built 1 4.6244e+13 1.0128e+15 428690
## + waterfront 1 3.8196e+13 1.0208e+15 428826
## + bedrooms 1 2.1640e+13 1.0374e+15 429105
## + renovated 1 1.0244e+13 1.0488e+15 429293
## + condition 1 9.3046e+12 1.0497e+15 429309
## + sqft_living15 1 9.2323e+12 1.0498e+15 429310
## + sqft_lot15 1 6.9146e+12 1.0521e+15 429348
## + sqft_lot 1 3.9236e+12 1.0551e+15 429397
## + bathrooms 1 2.3140e+12 1.0567e+15 429424
## + floors 1 1.7448e+12 1.0573e+15 429433
## <none> 1.0590e+15 429459
##
## Step: AIC=428017.6
## price ~ sqft_living + view + grade
##
## Df Sum of Sq RSS AIC
## + yr_built 1 1.1066e+14 8.6354e+14 425935
## + waterfront 1 4.0072e+13 9.3412e+14 427293
## + condition 1 2.0864e+13 9.5333e+14 427645
## + renovated 1 1.3455e+13 9.6074e+14 427779
## + bedrooms 1 1.1057e+13 9.6314e+14 427822
## + sqft_lot15 1 5.0887e+12 9.6911e+14 427929
## + floors 1 2.7444e+12 9.7145e+14 427971
## + sqft_lot 1 2.6566e+12 9.7154e+14 427972
## <none> 9.7420e+14 428018
## + bathrooms 1 1.1035e+11 9.7408e+14 428018
## + sqft_living15 1 1.4925e+09 9.7419e+14 428020
##
## Step: AIC=425934.8
## price ~ sqft_living + view + grade + yr_built
##
## Df Sum of Sq RSS AIC
## + waterfront 1 4.0818e+13 8.2272e+14 425100
## + bedrooms 1 1.1869e+13 8.5167e+14 425698
## + bathrooms 1 6.1747e+12 8.5736e+14 425813
## + floors 1 4.8417e+12 8.5869e+14 425840
## + sqft_lot15 1 3.7083e+12 8.5983e+14 425862
## + sqft_lot 1 2.1378e+12 8.6140e+14 425894
## + renovated 1 1.5630e+12 8.6197e+14 425906
## + condition 1 1.5122e+12 8.6202e+14 425907
## + sqft_living15 1 3.1983e+11 8.6322e+14 425930
## <none> 8.6354e+14 425935
##
## Step: AIC=425099.6
## price ~ sqft_living + view + grade + yr_built + waterfront
##
## Df Sum of Sq RSS AIC
## + bedrooms 1 9.8179e+12 8.1290e+14 424894
## + bathrooms 1 6.7148e+12 8.1600e+14 424960
## + floors 1 4.1674e+12 8.1855e+14 425014
## + sqft_lot15 1 3.8603e+12 8.1886e+14 425020
## + sqft_lot 1 1.9265e+12 8.2079e+14 425061
## + condition 1 1.5979e+12 8.2112e+14 425068
## + renovated 1 7.6988e+11 8.2195e+14 425085
## + sqft_living15 1 6.2102e+11 8.2210e+14 425089
## <none> 8.2272e+14 425100
##
## Step: AIC=424894
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms
##
## Df Sum of Sq RSS AIC
## + bathrooms 1 1.0157e+13 8.0274e+14 424679
## + sqft_lot15 1 5.2950e+12 8.0760e+14 424783
## + floors 1 4.2569e+12 8.0864e+14 424805
## + sqft_lot 1 2.8889e+12 8.1001e+14 424834
## + condition 1 2.1432e+12 8.1076e+14 424850
## + renovated 1 7.1013e+11 8.1219e+14 424881
## + sqft_living15 1 4.9966e+11 8.1240e+14 424885
## <none> 8.1290e+14 424894
##
## Step: AIC=424678.6
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms
##
## Df Sum of Sq RSS AIC
## + sqft_lot15 1 4.7365e+12 7.9801e+14 424578
## + floors 1 3.2782e+12 7.9946e+14 424610
## + sqft_lot 1 2.6152e+12 8.0013e+14 424624
## + condition 1 1.7809e+12 8.0096e+14 424642
## + sqft_living15 1 1.3301e+12 8.0141e+14 424652
## + renovated 1 2.6833e+11 8.0247e+14 424675
## <none> 8.0274e+14 424679
##
## Step: AIC=424578.3
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15
##
## Df Sum of Sq RSS AIC
## + floors 1 2.6691e+12 7.9534e+14 424522
## + condition 1 1.8839e+12 7.9612e+14 424539
## + sqft_living15 1 1.7001e+12 7.9631e+14 424543
## + renovated 1 2.8838e+11 7.9772e+14 424574
## <none> 7.9801e+14 424578
## + sqft_lot 1 1.7180e+09 7.9800e+14 424580
##
## Step: AIC=424522.4
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15 + floors
##
## Df Sum of Sq RSS AIC
## + condition 1 2.5106e+12 7.9283e+14 424470
## + sqft_living15 1 2.1413e+12 7.9319e+14 424478
## + renovated 1 1.5045e+11 7.9519e+14 424521
## <none> 7.9534e+14 424522
## + sqft_lot 1 1.3285e+08 7.9534e+14 424524
##
## Step: AIC=424469.7
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15 + floors + condition
##
## Df Sum of Sq RSS AIC
## + sqft_living15 1 2.2608e+12 7.9056e+14 424422
## + renovated 1 4.1564e+11 7.9241e+14 424463
## <none> 7.9283e+14 424470
## + sqft_lot 1 1.0595e+08 7.9283e+14 424472
##
## Step: AIC=424422.3
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15 + floors + condition +
## sqft_living15
##
## Df Sum of Sq RSS AIC
## + renovated 1 4.9482e+11 7.9007e+14 424414
## <none> 7.9056e+14 424422
## + sqft_lot 1 4.5084e+09 7.9056e+14 424424
##
## Step: AIC=424413.5
## price ~ sqft_living + view + grade + yr_built + waterfront +
## bedrooms + bathrooms + sqft_lot15 + floors + condition +
## sqft_living15 + renovated
##
## Df Sum of Sq RSS AIC
## <none> 7.9007e+14 424414
## + sqft_lot 1 4868282009 7.9006e+14 424415
##
## Call:
## lm(formula = price ~ sqft_living + view + grade + yr_built +
## waterfront + bedrooms + bathrooms + sqft_lot15 + floors +
## condition + sqft_living15 + renovated, data = train)
##
## Coefficients:
## (Intercept) sqft_living view grade yr_built
## -6.143e+05 1.635e+02 4.105e+04 1.138e+05 -6.491e+04
## waterfront bedrooms bathrooms sqft_lot15 floors
## 6.004e+05 -4.016e+04 4.429e+04 -6.069e-01 3.309e+04
## condition sqft_living15 renovated
## 2.183e+04 2.810e+01 2.843e+04
< Backward Elimination >
step(regfull, scope=list(lower=regnull, upper=regfull), direction="backward")
## Start: AIC=424415.4
## price ~ bedrooms + bathrooms + floors + waterfront + view + condition +
## grade + yr_built + sqft_living + sqft_lot + sqft_living15 +
## sqft_lot15 + renovated
##
## Df Sum of Sq RSS AIC
## - sqft_lot 1 4.8683e+09 7.9007e+14 424414
## <none> 7.9006e+14 424415
## - renovated 1 4.9518e+11 7.9056e+14 424424
## - sqft_lot15 1 2.3061e+12 7.9237e+14 424464
## - sqft_living15 1 2.3447e+12 7.9241e+14 424465
## - condition 1 2.9358e+12 7.9300e+14 424478
## - floors 1 3.6059e+12 7.9367e+14 424492
## - bathrooms 1 8.7436e+12 7.9881e+14 424604
## - view 1 1.2697e+13 8.0276e+14 424689
## - bedrooms 1 1.5031e+13 8.0510e+14 424739
## - waterfront 1 3.8509e+13 8.2857e+14 425236
## - sqft_living 1 8.1390e+13 8.7146e+14 426109
## - yr_built 1 8.9023e+13 8.7909e+14 426259
## - grade 1 9.6051e+13 8.8612e+14 426397
##
## Step: AIC=424413.5
## price ~ bedrooms + bathrooms + floors + waterfront + view + condition +
## grade + yr_built + sqft_living + sqft_living15 + sqft_lot15 +
## renovated
##
## Df Sum of Sq RSS AIC
## <none> 7.9007e+14 424414
## - renovated 1 4.9482e+11 7.9056e+14 424422
## - sqft_living15 1 2.3400e+12 7.9241e+14 424463
## - condition 1 2.9329e+12 7.9300e+14 424476
## - floors 1 3.6017e+12 7.9367e+14 424490
## - sqft_lot15 1 4.5943e+12 7.9466e+14 424512
## - bathrooms 1 8.7444e+12 7.9881e+14 424602
## - view 1 1.2706e+13 8.0278e+14 424687
## - bedrooms 1 1.5062e+13 8.0513e+14 424738
## - waterfront 1 3.8512e+13 8.2858e+14 425234
## - sqft_living 1 8.1831e+13 8.7190e+14 426116
## - yr_built 1 8.9033e+13 8.7910e+14 426258
## - grade 1 9.6056e+13 8.8613e+14 426395
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront +
## view + condition + grade + yr_built + sqft_living + sqft_living15 +
## sqft_lot15 + renovated, data = train)
##
## Coefficients:
## (Intercept) bedrooms bathrooms floors waterfront
## -6.143e+05 -4.016e+04 4.429e+04 3.309e+04 6.004e+05
## view condition grade yr_built sqft_living
## 4.105e+04 2.183e+04 1.138e+05 -6.491e+04 1.635e+02
## sqft_living15 sqft_lot15 renovated
## 2.810e+01 -6.069e-01 2.843e+04
As a result, every predictor except sqft_lot was chosen. We can use rest of the predictors in building our model.
< Linear Regression >
Using lm function, let’s come up with our first model. According to the summary, every predictor is statistically significant, but what’s counterintuitive that the sign of coefficient of bedrooms and sqft_lot15 is negative.
PRESS <- function(linear.model) {
## get the residuals from the linear.model. ## extract hat from lm.influence to obtain the leverages
pr <- residuals(linear.model) / (1-lm.influence(linear.model)$hat)
## calculate the PRESS by squaring each term and adding them up
PRESS <- sum(pr ^ 2)
return(PRESS)
}
result <- lm(price ~ bedrooms + bathrooms + floors + waterfront + view + condition + grade + yr_built + sqft_living +
sqft_living15 + sqft_lot15 + renovated, data = train)
summary(result)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront +
## view + condition + grade + yr_built + sqft_living + sqft_living15 +
## sqft_lot15 + renovated, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1231890 -111153 -8250 91144 4324636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.143e+05 1.798e+04 -34.169 < 2e-16 ***
## bedrooms -4.016e+04 2.213e+03 -18.149 < 2e-16 ***
## bathrooms 4.429e+04 3.203e+03 13.828 < 2e-16 ***
## floors 3.309e+04 3.729e+03 8.875 < 2e-16 ***
## waterfront 6.004e+05 2.069e+04 29.020 < 2e-16 ***
## view 4.105e+04 2.462e+03 16.669 < 2e-16 ***
## condition 2.183e+04 2.726e+03 8.008 1.24e-15 ***
## grade 1.138e+05 2.482e+03 45.831 < 2e-16 ***
## yr_built -6.491e+04 1.471e+03 -44.124 < 2e-16 ***
## sqft_living 1.635e+02 3.865e+00 42.302 < 2e-16 ***
## sqft_living15 2.810e+01 3.929e+00 7.153 8.81e-13 ***
## sqft_lot15 -6.069e-01 6.055e-02 -10.023 < 2e-16 ***
## renovated 2.843e+04 8.642e+03 3.289 0.00101 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 213800 on 17277 degrees of freedom
## Multiple R-squared: 0.6547, Adjusted R-squared: 0.6545
## F-statistic: 2730 on 12 and 17277 DF, p-value: < 2.2e-16
test$predict <- round(predict(result, newdata = test))
test_mse_ln <- mean((test$price - test$predict)^2)
test_mse_ln
## [1] 52482587023
summary(result)$r.squared
## [1] 0.6547201
summary(result)$adj.r.squared
## [1] 0.6544803
PRESS(result)
## [1] 7.947446e+14
##Find SST
anova_result<-anova(result)
SST<-sum(anova_result$"Sum Sq") ##R2 pred
Rsq_pred <- 1-PRESS(result)/SST
Rsq_pred
## [1] 0.6526771
According to VIFs, as all numbers are below threshold (10), there is no sign of multicollinearity in our model.
vif(result)
## bedrooms bathrooms floors waterfront view
## 1.607513 2.207145 1.532359 1.198319 1.353198
## condition grade yr_built sqft_living sqft_living15
## 1.209302 3.208107 1.793612 4.731896 2.740061
## sqft_lot15 renovated
## 1.065314 1.126851
Let’s check residual plot to check regression assumption. It seems that the second assumption, constanct variance is violated according to the first plot. To be specific, variance gets larger as fitted y gets larger. Therefore, we should implement y transformation to address this issue.
yhat <- result$fitted.values
res <- result$residuals
Data <- data.frame(train, yhat, res)
ggplot(Data, aes(x=yhat,y=res))+
geom_point()+
geom_hline(yintercept=0, color="red")+
labs(x="Fitted y",
y="Residuals",
title="Residual Plot")
acf(res)
qqnorm(res)
qqline(res, col="red")
Box Cox method is an analytical way to decide how to transform the response variable to achieve constant variance. According to the plot, the optimal \(\lambda\) is 0.1.
boxcox(result,lambda = seq(-1.,1,0.5))
Therefore, let’s transform our \(y^{*} = y^0.1\).
train <- train %>% mutate(price = price ^ 0.1)
After y-transformation, our model has now better results.
result2 <- lm(price ~ bedrooms + bathrooms + floors + waterfront + view + condition + grade + yr_built + sqft_living +
sqft_living15 + sqft_lot15 + renovated, data = train)
summary(result2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront +
## view + condition + grade + yr_built + sqft_living + sqft_living15 +
## sqft_lot15 + renovated, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58082 -0.07908 0.00364 0.07785 0.47712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.911e+00 9.660e-03 301.295 < 2e-16 ***
## bedrooms -8.982e-03 1.189e-03 -7.555 4.41e-14 ***
## bathrooms 2.592e-02 1.721e-03 15.062 < 2e-16 ***
## floors 3.859e-02 2.004e-03 19.263 < 2e-16 ***
## waterfront 1.443e-01 1.112e-02 12.981 < 2e-16 ***
## view 1.826e-02 1.323e-03 13.802 < 2e-16 ***
## condition 1.828e-02 1.465e-03 12.479 < 2e-16 ***
## grade 7.377e-02 1.334e-03 55.312 < 2e-16 ***
## yr_built -3.756e-02 7.904e-04 -47.518 < 2e-16 ***
## sqft_living 5.421e-05 2.077e-06 26.104 < 2e-16 ***
## sqft_living15 3.686e-05 2.111e-06 17.463 < 2e-16 ***
## sqft_lot15 -1.960e-07 3.254e-08 -6.025 1.73e-09 ***
## renovated 1.466e-02 4.643e-03 3.158 0.00159 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1149 on 17277 degrees of freedom
## Multiple R-squared: 0.6585, Adjusted R-squared: 0.6583
## F-statistic: 2776 on 12 and 17277 DF, p-value: < 2.2e-16
test$predict <- round(predict(result2, newdata = test) ^ 10)
test_mse_ln_2 <- mean((test$price - test$predict)^2)
test_mse_ln_2
## [1] 43928749327
summary(result2)$r.squared
## [1] 0.658523
summary(result2)$adj.r.squared
## [1] 0.6582858
PRESS(result2)
## [1] 228.5181
##Find SST
anova_result<-anova(result2)
SST<-sum(anova_result$"Sum Sq")
##R2 pred
Rsq_pred <- 1-PRESS(result2)/SST
Rsq_pred
## [1] 0.6578727
According to residual plot, acf plot, and normal probability plot, all of the regression assumptions are satisified.
yhat<-result2$fitted.values
res<-result2$residuals
Data<-data.frame(train,yhat,res)
ggplot(Data, aes(x=yhat,y=res))+
geom_point()+
geom_hline(yintercept=0, color="red")+
labs(x="Fitted y",
y="Residuals",
title="Residual Plot")
acf(res)
qqnorm(res)
qqline(res, col="red")
tail(test)
## price bedrooms bathrooms floors waterfront view condition grade
## 21585 380000 3 2 2 0 0 3 7
## 21592 572000 4 3 2 0 0 3 8
## 21594 1088000 5 4 2 0 2 3 10
## 21599 541800 4 2 2 0 2 3 9
## 21602 467000 3 2 3 0 0 3 8
## 21607 1007500 4 4 2 0 0 3 9
## yr_built sqft_living sqft_lot sqft_living15 sqft_lot15 renovated predict
## 21585 5 1260 900 1310 1415 0 285799
## 21592 5 2770 3852 1810 5641 0 484241
## 21594 5 4170 8142 3030 7980 0 1113250
## 21599 5 3118 7866 2673 6500 0 692559
## 21602 5 1425 1179 1285 1253 0 400368
## 21607 5 3510 7200 2050 6200 0 717624
In order to increase the performance our model, let’s identify outliers, high leverage points, and influential points in the dataset.
By checking standardized residual, studentized reisdual, and externally studentized reisudals, we can identify the presence of outliers in our dataset. As a result, there is no outlier in our dataset.
n <- nrow(train)
p <- 13
cv <- qt(1-0.05,(2*n), n-1-p)
res <- result2$residuals
standard.res<- res/summary(result2)$sigma
student.res <- rstandard(result2)
ext.student.res <- rstudent(result2)
ext.student.res[abs(ext.student.res)>cv]
## named numeric(0)
res.frame<-data.frame(res,standard.res,
student.res,ext.student.res)
par(mfrow=c(1,3))
plot(result2$fitted.values,standard.res,
main="Standardized Residuals",
ylim=c(-4.5,4.5))
plot(result2$fitted.values,student.res,
main="Studentized Residuals",
ylim=c(-4.5,4.5))
plot(result2$fitted.values,ext.student.res,
main="Externally Studentized Residuals",
ylim=c(-4.5,4.5))
< Leverage >
By checking leverage, we can identify how many observations are far away from average of predictors and response. According to the result, total 1604 observations (about 9.3 %) are far from the centroid of the predictor space. These high leverage points are potentially to be influential observations.
# leverage
lev <- lm.influence(result2)$hat
# identify high leverage points
x <- lev[lev > 2 * p / n]
length(x) / n
## [1] 0.09277039
< Detecting Influential Observations >
After finding the observations that are outlying / high leverage, the next step is to ascertain whether or not these observations are influential. Measures of influence address how much estimates (fitted values, coefficients, etc..) will change if observation was deleted. By looking at Cook’s Distance, DFFITS (Difference in Fits), DFBETAS (Differenece in Betas), we can detect influetial observations.
Result from Cook’s distance does not show sign of infleutial observations.
# COOK's Distance
COOKS<-cooks.distance(result2)
y <- COOKS[COOKS>qf(0.5,p,n-p)]
length(y)
## [1] 0
However, the result from DFFITS shows the list indices of influential points.
# DFFITS (Difference in Fits)
DFFITS <- dffits(result2)
z <- (DFFITS[abs(DFFITS)>2*sqrt(p/n)])
z
## 18183 20728 18175 13284 19485 13973
## 0.07108293 0.06319248 0.05902664 -0.05717550 0.10840119 -0.08076250
## 8547 5936 719 5136 9086 14837
## -0.09930367 -0.10606113 0.06050325 -0.05954569 -0.14109147 -0.07918745
## 12669 5388 4590 3616 12339 2137
## -0.07959627 -0.10368078 0.05747444 -0.08701908 -0.05614251 0.05791141
## 10141 8706 10994 3760 11014 116
## -0.05546738 0.05800459 -0.06496679 -0.06195006 0.07061236 -0.08237987
## 313 3157 15135 10841 14276 11447
## 0.08174423 0.08015213 -0.16377244 0.07990202 0.12474943 0.10993350
## 17871 7379 3363 18004 11600 18520
## -0.05853706 -0.06872001 0.06765316 0.05800427 -0.05989450 0.05573200
## 12277 20594 2506 18334 4777 2267
## 0.06004510 0.07440618 -0.06879759 -0.07068597 -0.06192556 -0.05904749
## 217 7987 1654 9885 4630 16940
## 0.06321358 0.08574407 -0.10550317 -0.06103064 0.06503921 -0.05673738
## 12552 19149 11471 16889 4475 3375
## -0.05940134 0.11966224 0.05517153 -0.06736042 -0.06261089 0.05686820
## 15071 3158 4527 4612 6963 17143
## -0.07444605 -0.07593816 0.06007790 -0.10863439 0.05908704 0.07940221
## 10386 2182 13816 9453 4086 18595
## -0.08426683 0.07208450 -0.06449253 0.06522075 0.06137966 -0.12722131
## 17349 19650 14451 1362 18793 19150
## -0.06060542 0.05724713 0.05644792 0.06068046 -0.06347476 0.05484574
## 20248 18333 3915 2076 18428 19826
## 0.06919517 -0.22295939 0.14523765 0.14880907 -0.07313483 -0.05703511
## 3229 11559 12827 3768 6305 17658
## -0.09522491 0.05496718 0.10291240 -0.05711788 -0.05586900 -0.14615723
## 14514 12711 14424 2985 1865 8093
## 0.10714687 -0.05502144 -0.11554095 -0.05557590 -0.06105383 -0.11803420
## 16908 2445 5082 8569 13852 283
## -0.06369343 0.06172163 0.05934065 -0.09432188 0.08413628 0.06229435
## 12986 16776 3439 12105 13236 20093
## -0.09410781 0.05624977 0.07046594 0.06213938 0.07633140 0.06305631
## 16378 4550 3726 4442 19298 301
## 0.09433068 -0.09545193 -0.06862263 0.06305899 0.05528995 0.15349075
## 20175 4766 17198 14386 6897 12187
## 0.06116196 0.06429567 -0.16915868 0.17323899 -0.06745552 0.06839312
## 19382 16270 839 17900 10959 17152
## 0.07140647 0.09383761 -0.05652887 0.08598118 0.09773307 0.14946613
## 312 12371 18507 7990 10505 10662
## -0.07336048 0.08499062 -0.06476957 0.16120651 0.07064676 -0.05598395
## 5748 13710 18896 2900 17153 21384
## -0.10499340 -0.09135252 -0.07150133 0.11815055 -0.13409275 0.05926969
## 5828 5257 15669 19472 886 15765
## 0.09149204 -0.06091287 -0.17495360 0.07143248 0.11123039 0.05952294
## 11536 11258 13968 15431 21432 18355
## 0.12620039 0.08724044 0.14716609 0.05713714 0.13908702 0.05583365
## 19157 20387 3224 359 13985 4150
## -0.10757396 0.09532979 0.06818043 0.11549926 0.06617025 0.12054932
## 10981 14222 7295 458 17493 2032
## -0.10667159 0.05560285 0.17822087 -0.12867810 -0.09052462 -0.13154013
## 14191 16681 18427 8540 1295 20825
## 0.08026588 0.05782038 0.05612450 0.06604821 -0.08799128 0.07388799
## 657 4697 12588 17529 18469 5304
## 0.29439907 -0.06247325 0.05618829 -0.11216622 -0.11493031 -0.06632014
## 5792 19481 12725 17549 15163 19453
## 0.06249082 0.07054349 -0.07419790 -0.05523865 0.06383186 0.18368305
## 19175 997 9213 20155 15159 20940
## -0.06949802 -0.05526628 -0.05636474 0.18376486 0.06216365 0.05913841
## 20185 17660 17364 13436 16845 6824
## 0.05601295 0.05552905 0.07698730 -0.06277547 -0.07125624 0.08179609
## 2224 247 3043 17018 7424 16345
## 0.09145185 0.26185334 0.06670999 -0.06458052 0.05983266 -0.05524523
## 6995 21326 6892 13831 18913 11285
## -0.07887286 0.10973710 0.07532461 0.06101163 0.08882648 0.05680292
## 19138 18686 13478 3541 4082 12152
## -0.07207382 0.08964322 0.09463365 -0.06085206 0.05802625 0.06604497
## 8029 9369 18587 18330 13021 9642
## 0.09324214 0.07663521 -0.09396192 -0.06996668 0.10007375 0.08108778
## 11879 1031 12697 18288 14775 2558
## 0.08826250 0.06234696 0.07032536 0.07762940 -0.17375180 -0.07400285
## 12908 17586 14964 519 4923 5789
## -0.07015386 0.06756869 0.05706398 0.07515092 0.08578459 -0.05544796
## 15697 6198 231 6234 70 18022
## -0.07067144 0.06389339 -0.13447681 -0.08730313 0.07522748 0.10218588
## 4191 15328 9622 16536 6809 12150
## 0.05845354 -0.06122496 -0.06964431 -0.09010370 0.08078622 0.06803387
## 20225 20756 10254 541 16253 11832
## 0.06716845 0.07734146 -0.06630145 -0.17382420 0.11061058 -0.10306714
## 3205 17745 1221 8917 17597 3
## -0.12713790 -0.07132727 -0.06612544 0.06411035 0.10183792 -0.06041297
## 6741 16393 415 3934 18456 20097
## 0.09250009 0.08849357 -0.05994076 0.06082247 0.10607392 0.09239186
## 2590 20895 4930 13630 4652 8639
## -0.14969553 0.07050700 -0.05503907 0.09459262 -0.05945616 0.16355438
## 5178 5377 20934 882 4219 770
## 0.05639895 -0.09218005 0.07791587 -0.06188406 -0.08780400 -0.05972490
## 10586 6046 8750 2154 3885 5852
## -0.07195325 0.06241155 -0.05910855 0.05722436 -0.05496736 0.06114941
## 1787 14071 9609 14295 18867 15214
## 0.06988851 0.05976932 0.09815374 -0.07514992 0.07599147 0.05489008
## 9344 1311 21345 16179 21369 18607
## 0.07304106 -0.07974223 -0.18024150 -0.07099211 0.08573644 0.10147200
## 18203 13044 5367 12427 14329 16005
## 0.05742608 -0.05759074 -0.05711293 -0.06094474 0.06175693 0.06851916
## 15495 4924 2618 18002 6403 7232
## -0.07743400 0.08844843 -0.05525030 -0.06487375 0.10954667 0.07041252
## 4860 66 10767 4412 19469 12125
## 0.05900183 -0.06856526 0.07032302 0.12947835 0.05498374 0.11153748
## 16521 9078 7846 19957 1326 7537
## 0.10587766 0.06323873 -0.07747703 0.07155702 -0.05568308 0.06255010
## 13378 18795 5669 3764 11704 3230
## 0.05822239 0.05942163 -0.06124087 0.05863350 -0.06386975 0.07751080
## 7934 10361 4241 5881 16707 18276
## 0.07638054 -0.11712218 0.06437378 0.11281363 -0.06437941 -0.16233349
## 17954 1434 14804 3537 8708 20021
## -0.05593713 0.05670237 0.05646210 0.13035223 0.05829097 0.05559131
## 18227 20624 4025 126 10111 15522
## 0.11689148 0.05821774 -0.33887182 0.08000492 0.06184661 0.05982442
## 20875 9460 3951 20963 8618 7783
## -0.05502156 -0.07772365 -0.07450546 -0.09221637 -0.06452741 0.07079328
## 18513 9295 12568 6458 15841 18096
## 0.06824290 0.05624665 0.05515196 0.07072592 0.11165868 0.05589893
## 6953 11875 10319 4872 17475 4706
## -0.08437976 0.11635788 -0.08086892 0.05690086 -0.25297126 0.05694014
## 19668 8050 18380 3872 2944 295
## 0.05792198 0.13649415 0.19754956 0.08016384 -0.06402981 -0.05522868
## 14472 14242 20042 2714 12647 752
## -0.10780025 -0.08271626 0.06914056 -0.20548676 0.09352166 0.05727542
## 4181 14737 21352 4763 17845 15247
## -0.06258015 -0.06160901 0.12384079 0.11344981 0.07766217 0.09460210
## 9926 18761 13788 18528 240 5025
## 0.06181817 -0.06140221 0.06118910 -0.06251784 -0.08919034 -0.06539265
## 15023 1397 8830 18989 13966 10892
## 0.06803801 -0.06003608 -0.07737612 -0.06980278 -0.05644983 -0.09757097
## 1809 10264 760 17402 15633 17577
## 0.07733518 0.19057000 -0.08073097 -0.09198329 0.07977387 0.10234049
## 19324 19685 1883 159 2896 19782
## 0.06819736 0.08500270 0.07036221 -0.08069348 -0.08267836 -0.06946275
## 11402 4363 8164 15040 7416 18705
## 0.05898108 0.06744806 0.08470630 0.19437678 0.08842645 0.06036455
## 3528 13041 5774 6826 11872 9157
## -0.06010001 0.07213705 -0.05988558 -0.06740608 -0.09548012 -0.05768331
## 9557 8320 5381 17450 16571 3587
## 0.07612202 0.08972642 0.06099323 0.05648136 -0.05498992 -0.08332844
## 14648 7887 5571 19467 19962 15618
## 0.06076670 -0.07720355 -0.05617430 0.09563764 0.07903091 0.13755162
## 2383 5673 7098 13811 16016 8785
## -0.07091827 -0.08442465 -0.13177439 -0.06580293 0.05735440 -0.07271719
## 13967 6769 13826 7123 18024 7270
## -0.21007995 -0.05546427 -0.05994089 0.05666347 0.05640313 -0.05664152
## 14582 1200 351 3253 5590 13257
## -0.09437688 -0.06844834 0.07804958 -0.14603333 0.07617241 0.07484506
## 18070 14921 9323 6515 20536 270
## 0.07505360 0.08647714 0.05978756 0.06111724 0.09779741 0.10521380
## 20008 16039 4812 13629 10284 16715
## 0.07277246 -0.07018999 -0.08879640 -0.08706015 -0.06107481 -0.05746634
## 14840 7607 14856 18780 15869 3109
## -0.10241042 0.08654399 -0.05859315 -0.06341514 -0.07804742 -0.05699177
## 8915 19913 16185 12940 420 13663
## -0.08732471 0.06632595 0.11013388 -0.05626900 -0.07580941 -0.06525244
## 7121 20370 12933 6614 13400 800
## -0.09380302 0.07209740 0.08635548 -0.05533993 0.10794918 -0.05771530
## 18589 19087 3379 7136 4524 11741
## 0.09450220 0.05726069 -0.06376454 0.06704282 0.06685207 0.05732004
## 1449 8665 15011 17350 444 2430
## 0.12186388 0.07387338 -0.05571048 0.06983920 0.07733287 0.05669681
## 8890 2564 6524 2304 3976 8058
## -0.05909834 0.06717681 -0.05963004 -0.09045328 -0.05650660 0.05628600
## 12337 20423 1244 10648 8444 11279
## 0.05859208 0.05743791 -0.06525520 0.07298027 0.08636107 0.13728392
## 6533 1437 7959 3278 6378 8817
## -0.05740563 -0.06133080 -0.06976886 0.07068839 0.09847261 0.05952260
## 8856 15945 19455 18803 18200 1957
## 0.08657320 0.09209136 -0.06219256 -0.10145270 0.06869093 0.06047496
## 466 13154 10920 6784 5720 1932
## -0.09908183 0.06985926 0.05772840 0.06526192 -0.06446319 -0.08235971
## 2086 18656 8642 19098 18646 5450
## 0.05676922 0.05925669 -0.08508716 -0.10149944 0.08545311 0.09809084
## 1808 20297 21103 13606 9033 3778
## -0.07794957 -0.07737086 -0.07757591 0.13677372 -0.07564644 0.05722280
## 3040 15145 13673 6832 15238 4049
## 0.08309355 -0.06295635 0.06546142 -0.05598540 0.07460585 -0.07355320
## 20579 1735 15021 17307 17950 7531
## -0.07194348 -0.07100500 0.05887795 0.09257055 -0.13727144 0.07517333
## 3583 3862 5600 7834 16289 8478
## 0.07877095 0.06002156 -0.07312489 0.05900212 0.12097651 -0.05811937
## 17115 10964 19987 15483 17570 11333
## -0.06823092 -0.13255425 0.07622841 -0.09102489 0.06281269 0.05684365
## 12649 3440 18293 1386 4769 10771
## 0.11044821 0.06198071 -0.09749187 -0.05716780 -0.12571127 -0.05548057
## 2787 876 1850 4564 9173 2865
## 0.06150604 0.16182841 -0.07198668 -0.06535846 0.06076399 0.11058515
## 2798 498 18536 13965 14140 11106
## 0.05572731 -0.07741593 -0.06305540 0.06039245 0.08796435 -0.05777285
## 17408 20326 17768 8161 10470 10447
## 0.05592642 0.09141891 0.13726908 -0.05688934 0.05971804 0.09951650
## 8979 19418 7370 3952 10421 12510
## 0.07139133 -0.09288807 -0.11166059 -0.07450546 -0.06859208 0.10060053
## 13149 6692 5562 18006 12778 18512
## 0.08848013 0.27256779 -0.06141728 0.12487391 -0.72460199 0.08421228
## 20564 2047 16427 20253 3719 4582
## -0.06746613 0.07409474 -0.05691316 0.07615391 0.05520506 -0.06155510
## 21324 2041 19726 9970 17876 1573
## 0.06592488 0.06289881 0.07548353 0.06159988 0.05629513 0.09159554
## 13543 8386 21142 3259 12819 12714
## 0.05944797 0.05806847 0.09361514 0.07777520 0.06924573 0.08797078
## 5833 18849 2142 14551 5064 15377
## 0.13276578 -0.35866595 -0.07143585 0.14226512 -0.05587090 0.06468140
## 17903 892 8130 19673 18976 13072
## 0.06526040 0.09096267 -0.05980854 -0.06401717 -0.06468962 0.09935068
## 7320 8717 12232 14828 17810 11953
## 0.08369567 0.09243602 -0.06600442 -0.15894459 -0.06812206 0.08972838
## 9106 13488 19260 2474 12614 3092
## 0.06778837 -0.06133681 -0.09868033 0.07355447 0.11958108 0.05709639
## 19117 15775 13442 1731 5119 12906
## -0.07507886 -0.09870355 0.09714945 0.06245940 -0.05996527 0.06241194
## 16804 8388 16414 8223 7069 7253
## -0.07986587 0.09273750 -0.06946000 0.07765127 0.09035226 -0.19148554
## 13549 14783 1880 9715 18209 3185
## 0.07408928 -0.06249606 0.08381725 -0.07617292 0.12827603 -0.06599895
## 5139 2075 17252 7429 7251 17083
## 0.06488671 -0.09973704 -0.05540390 0.05660453 0.06052728 0.08680303
## 14900 7452 2307 16843 6013 19732
## -0.07452334 0.06002563 -0.09115086 -0.10731967 0.06700331 -0.11145337
## 5640 5776 2412 15871 5551 14048
## -0.05972987 0.06252856 -0.13864874 1.28527891 -0.07021696 -0.08980036
## 1755 10476 20145 8531 13261 20371
## -0.07089967 0.05753903 0.10968522 -0.08342586 0.05898900 0.10081856
## 19824 9722 17957 9302 8326 17572
## 0.08096301 0.07985825 0.06187136 0.05601557 -0.07096757 -0.12385910
## 10469 10428 1485 14255 7701 13295
## 0.10106618 0.09431438 0.05521356 0.11258296 -0.05646027 0.08761202
## 18629 18394 17531 5191 4541 15588
## 0.06481850 0.08673979 0.06913950 0.06145712 0.05906316 -0.07830251
## 12936 16855 4602 17772 3238 9555
## -0.12433825 0.07343956 0.05704021 -0.05808546 -0.06296861 -0.06012670
## 3264 3672 1161 21373 8475 14056
## 0.05673810 0.06050271 -0.07392745 -0.10941725 -0.05493448 0.06789239
## 1032 10843 15139 1264 1531 7786
## 0.08428998 0.07468501 -0.05821672 -0.05537226 -0.05827980 -0.06210923
## 7996 6080 21149 13526 12684 417
## -0.06751234 0.06842547 0.06490269 -0.12996782 0.06269976 0.06762985
## 10326 13608 3387 14853 17290 5968
## 0.05751493 -0.06866894 0.06807383 -0.06656989 0.07553078 -0.05842884
## 5030 3806 13908 6464 14216 10170
## -0.06593869 -0.08542474 -0.11040141 -0.05827068 0.05817783 -0.07371833
## 10828 4273 15416 682 4203 9275
## -0.05560873 0.05710634 0.17646129 0.07809119 0.07736733 -0.07218134
## 21358 11703 14572 18273 15378 11774
## 0.05530573 -0.10465324 -0.06404848 0.07908051 0.22437070 0.05616608
## 3403 11797 14557 19623 15032 1623
## -0.06702747 -0.05721405 -0.18730005 0.11858310 -0.14273870 -0.09195913
## 6393 15007 13771 2150 4340 6103
## -0.07502336 -0.09509251 -0.05954666 -0.07718390 0.10053608 -0.07772130
## 786 10868 19337 5762 10526 15955
## -0.05820930 0.05693276 0.05506844 -0.05589593 0.06949036 0.11308467
## 7847 14188 4610 2846 9074 15693
## -0.07913104 -0.12528360 -0.05860391 0.08752495 -0.11766781 -0.25862054
## 14902 14570 12980 15294 1927 9324
## 0.10451500 0.05572357 -0.05501150 -0.10136499 -0.05957692 0.08843227
## 12886 4913 7433 15516 15411 14367
## 0.07673982 -0.08827226 0.11927691 -0.05910905 0.11566163 -0.06974006
## 11603 5413 14798 1273 17708 5609
## 0.05649084 -0.06172797 0.06466677 0.09728207 -0.05488721 0.06985099
## 10070 1472 13117 5593 11709 8997
## -0.06877214 0.06395512 -0.06465741 0.07087921 0.05513587 0.05905700
## 5163 13199 1257 20501 11254 17102
## 0.06367470 0.05989496 -0.05669088 0.08786917 -0.06998760 0.05724316
## 11160 17860 11299 7541 19889 591
## -0.11092348 -0.12401331 -0.06124001 -0.05539894 0.05569711 0.06658016
## 12209 4388 6380 18965 21072 8278
## -0.08357803 -0.08164945 -0.06544323 0.06270183 0.05673104 -0.28920841
## 428 11730 10620 1938 9857 11634
## 0.07774010 -0.11938563 -0.06587556 0.06502878 -0.10297896 0.08466670
## 6155 18794 15634 14349 14148 19105
## -0.07909587 0.06805622 -0.05909267 -0.06069094 0.06322879 -0.15998128
## 11218 18305 6399 485 8345 20898
## 0.06241337 0.06394627 0.07013151 0.06053187 -0.12046489 0.05566937
## 7281 17931 16107 3802 17839 5026
## -0.06700666 0.09049822 0.07556182 0.07542499 0.06520629 0.06558673
## 16082 1283 3353 5434 13313 12135
## 0.05646463 0.12498455 0.06505477 0.08663073 -0.14970723 0.11597006
## 4409 2059 19092 16341 14323 9450
## 0.08903873 -0.06592801 -0.06876561 -0.06030106 0.06405057 0.07364000
## 12397 19530 14099 5321 4135 6907
## 0.05722777 0.08109214 0.11630555 -0.06332987 -0.06093333 -0.08514851
## 21569 5990 13917 19189 16779 9199
## 0.06172140 0.05845615 0.06268721 -0.19720352 -0.06672879 -0.11446217
## 20453 13559 13676 18228 15888 8832
## 0.26654896 -0.05569079 0.08544398 0.05900581 0.09314487 0.09014571
## 11891 15893 9795 8607 2688 14053
## 0.06917554 0.11098310 0.10220817 0.08426558 0.05635489 0.10657598
## 12482 12288 11955 17716 11933 6550
## -0.05979510 0.05756861 -0.05643967 0.12444524 -0.07415478 -0.06889916
## 6036 10374 3169 10983 18483 9778
## -0.05886720 -0.08136845 -0.05983283 0.06431959 0.07977601 -0.06220027
## 7314 18478 5865 16970 2882 3282
## 0.10374048 -0.08871755 0.06495482 0.07176723 -0.06846825 0.07337423
## 1585 5851 19523 19778 9851 15298
## 0.07436225 -0.07927791 -0.10897348 0.06621383 0.06468412 0.06366740
## 12019 8674 5618 20951 12058 11366
## 0.06954498 0.07774463 0.09332332 0.05867510 0.06375438 -0.06054659
## 14012 10860 13058 8538 17797 13622
## -0.05632986 -0.06141979 -0.06341953 0.06262987 -0.05869278 0.05965714
## 1627 19386 13628 10023 18555 4792
## -0.09529229 -0.08103094 -0.07020420 -0.07811609 0.07054976 0.07313937
## 9548 14985 15329 7993 6433 3688
## 0.09940457 -0.08067415 0.07834550 -0.09365170 -0.07091925 0.08574485
## 10980 12115 3464 4036 2409 3955
## 0.06749533 -0.09788558 0.08884797 0.19367404 0.08816436 -0.06680181
## 13942 17387 8537 11785 8446 11054
## -0.08900517 0.10213393 0.08802105 -0.07799622 0.10910560 -0.07087861
## 19985 18344 21051 19018 19462 8645
## 0.09831118 -0.07291265 -0.27518356 0.06615561 -0.11791368 0.05913601
## 8249 7912 15752 16996 18877 4938
## -0.07664036 0.11737763 -0.12766052 -0.07213014 0.15035517 0.08575936
## 9412 10192 7518 3322
## 0.06600336 -0.07604076 -0.07537319 -0.06403156
As we now know influential points, let’s remove them from our train data.
indices <- rownames(data.frame(z))
train2 <- train
train2$indices <- rownames(train2)
for (i in indices) {
train2 <- train2 %>% filter(indices != i)
}
nrow(train2)
## [1] 16362
After deleting outliers, high leverage points, and influential points, the performance of our model has increased.
Also, our new model meets all regeression assumptions.
result3 <- lm(price ~ bedrooms + bathrooms + floors + waterfront + view + condition + grade + yr_built + sqft_living +
sqft_living15 + sqft_lot15 + renovated, data = train2)
summary(result3)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront +
## view + condition + grade + yr_built + sqft_living + sqft_living15 +
## sqft_lot15 + renovated, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36488 -0.07307 0.00445 0.07407 0.37689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.918e+00 9.293e-03 314.005 < 2e-16 ***
## bedrooms -1.220e-02 1.199e-03 -10.169 < 2e-16 ***
## bathrooms 2.796e-02 1.647e-03 16.975 < 2e-16 ***
## floors 4.394e-02 1.872e-03 23.472 < 2e-16 ***
## waterfront 1.508e-01 1.470e-02 10.260 < 2e-16 ***
## view 1.966e-02 1.324e-03 14.847 < 2e-16 ***
## condition 1.784e-02 1.376e-03 12.969 < 2e-16 ***
## grade 7.446e-02 1.279e-03 58.220 < 2e-16 ***
## yr_built -4.079e-02 7.413e-04 -55.021 < 2e-16 ***
## sqft_living 5.526e-05 2.094e-06 26.394 < 2e-16 ***
## sqft_living15 3.414e-05 2.099e-06 16.264 < 2e-16 ***
## sqft_lot15 -2.928e-07 3.788e-08 -7.731 1.13e-14 ***
## renovated 1.110e-02 4.850e-03 2.289 0.0221 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1033 on 16349 degrees of freedom
## Multiple R-squared: 0.6727, Adjusted R-squared: 0.6725
## F-statistic: 2801 on 12 and 16349 DF, p-value: < 2.2e-16
test$predict <- round(predict(result3, newdata = test) ^ 10)
test_mse_ln_3 <- mean((test$price - test$predict)^2)
test_mse_ln_3
## [1] 44104007128
summary(result3)$r.squared
## [1] 0.6727283
summary(result3)$adj.r.squared
## [1] 0.6724881
PRESS(result3)
## [1] 174.7996
##Find SST
anova_result<-anova(result3)
SST <- sum(anova_result$"Sum Sq")
##R2 pred
Rsq_pred <- 1-PRESS(result3)/SST
Rsq_pred
## [1] 0.6723251
yhat<-result3$fitted.values
res<-result3$residuals
Data<-data.frame(train2,yhat,res)
ggplot(Data, aes(x=yhat,y=res))+
geom_point()+
geom_hline(yintercept=0, color="red")+
labs(x="Fitted y",
y="Residuals",
title="Residual Plot")
acf(res)
qqnorm(res)
qqline(res, col="red")
vif(result3)
## bedrooms bathrooms floors waterfront view
## 1.683463 2.093926 1.560657 1.109195 1.231753
## condition grade yr_built sqft_living sqft_living15
## 1.228025 3.110906 1.808906 4.792639 2.929764
## sqft_lot15 renovated
## 1.074602 1.100928
In order to deal with negative coefficient in sqft_lot15 predictor, let’s implement log transformation on sqft_lot15.
ggplot(train2, aes(x = sqft_lot15, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Sqft Lot15", y = "Price", title = "A Scatterplot of Sqft Lot 15 vs Price")
## `geom_smooth()` using formula 'y ~ x'
ggplot(train2, aes(x = log(sqft_lot15), y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Sqft Lot15", y = "Price", title = "A Scatterplot of Sqft Lot 15 vs Price")
## `geom_smooth()` using formula 'y ~ x'
Although log transformation sqft_lot15 predictor, the coefficient is still negative. However, the general performance of our model has slightly increased than the previous one.
result4 <- lm(price ~ bedrooms + bathrooms + floors + waterfront + view + condition + grade + yr_built + sqft_living +
sqft_living15 + log(sqft_lot15) + renovated, data = train2)
summary(result4)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + floors + waterfront +
## view + condition + grade + yr_built + sqft_living + sqft_living15 +
## log(sqft_lot15) + renovated, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35417 -0.07130 0.00412 0.07211 0.38181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.160e+00 1.418e-02 222.882 < 2e-16 ***
## bedrooms -1.155e-02 1.178e-03 -9.799 < 2e-16 ***
## bathrooms 2.370e-02 1.637e-03 14.478 < 2e-16 ***
## floors 2.861e-02 1.980e-03 14.448 < 2e-16 ***
## waterfront 1.684e-01 1.452e-02 11.599 < 2e-16 ***
## view 1.838e-02 1.307e-03 14.065 < 2e-16 ***
## condition 1.931e-02 1.359e-03 14.214 < 2e-16 ***
## grade 7.324e-02 1.262e-03 58.029 < 2e-16 ***
## yr_built -3.961e-02 7.329e-04 -54.039 < 2e-16 ***
## sqft_living 6.243e-05 2.089e-06 29.882 < 2e-16 ***
## sqft_living15 4.404e-05 2.122e-06 20.754 < 2e-16 ***
## log(sqft_lot15) -2.787e-02 1.228e-03 -22.698 < 2e-16 ***
## renovated 1.568e-02 4.788e-03 3.275 0.00106 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1019 on 16349 degrees of freedom
## Multiple R-squared: 0.6816, Adjusted R-squared: 0.6813
## F-statistic: 2916 on 12 and 16349 DF, p-value: < 2.2e-16
test$predict <- round(predict(result4, newdata = test) ^ 10)
test_mse_ln_4 <- mean((test$price - test$predict)^2)
test_mse_ln_4
## [1] 43387370710
summary(result4)$r.squared
## [1] 0.6815662
summary(result4)$adj.r.squared
## [1] 0.6813325
PRESS(result4)
## [1] 170.087
##Find SST
anova_result<-anova(result4)
SST <- sum(anova_result$"Sum Sq")
##R2 pred
Rsq_pred <- 1-PRESS(result4)/SST
Rsq_pred
## [1] 0.6811591
yhat<-result4$fitted.values
res<-result4$residuals
Data<-data.frame(train2,yhat,res)
ggplot(Data, aes(x=yhat,y=res))+
geom_point()+
geom_hline(yintercept=0, color="red")+
labs(x="Fitted y",
y="Residuals",
title="Residual Plot")
acf(res)
qqnorm(res)
qqline(res, col="red")
vif(result4)
## bedrooms bathrooms floors waterfront view
## 1.670024 2.125405 1.794334 1.112768 1.234111
## condition grade yr_built sqft_living sqft_living15
## 1.231155 3.113223 1.817342 4.904594 3.076865
## log(sqft_lot15) renovated
## 1.462865 1.102725
In order to boost the predictive performance of our model let’s take fully advantage of zipcode predictor that we dropped at the beginning. In our dataset, we have total 70 distinct zipcodes of King County.
house2 <- read.csv('kc_house_data.csv')
house2 <- house2 %>% dplyr::select(-id, -date, -lat, -long)
zipcode <- unique(house2$zipcode)
zipcode
## [1] 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 98007 98115
## [13] 98107 98126 98019 98103 98002 98133 98040 98092 98030 98119 98112 98052
## [25] 98027 98117 98058 98001 98056 98166 98023 98070 98148 98105 98042 98008
## [37] 98059 98122 98144 98004 98005 98034 98075 98116 98010 98118 98199 98032
## [49] 98045 98102 98077 98108 98168 98177 98065 98029 98006 98109 98022 98033
## [61] 98155 98024 98011 98031 98106 98072 98188 98014 98055 98039
From www.niche.com, we can derive zipcodes of King County with overall grade above A-. Then, create new categorical variable displaying 1, if zipcode is in the list, 0 otherwise. The distribution of two categories are pretty well-balanced by looking at the histogram below.
# https://www.niche.com/places-to-live/search/best-zip-codes-to-live/c/king-county-wa/
# Overall Grade > A-
good_zip <- c(98004, 98005, 98052, 98121, 98007, 98109, 98033, 98122, 98029, 98006, 98103, 98102, 98074, 98101, 98040, 98115, 98112, 98107, 98119, 98105, 98075, 98008, 98116, 98053, 98034, 98039, 98144, 98199, 98117, 98104, 98028, 98027, 98011, 98177, 98125, 98065, 98072, 98077, 98126, 98155, 98136, 98059, 98133, 98188, 98106)
house2$good_neigh <- ifelse(house2$zipcode %in% good_zip, 1, 0)
hist(house2$good_neigh)
house2$yr_built <- case_when(
(1900 <= house2$yr_built) & (house2$yr_built< 1920) ~ 0,
(1920 <= house2$yr_built) & (house2$yr_built< 1940) ~ 1,
(1940 <= house2$yr_built) & (house2$yr_built< 1960) ~ 2,
(1960 <= house2$yr_built) & (house2$yr_built< 1980) ~ 3,
(1980 <= house2$yr_built) & (house2$yr_built< 2000) ~ 4,
(2000 <= house2$yr_built) ~ 5)
house2$renovated <- ifelse(house2$yr_renovated != 0, 1, 0)
house2 <- house2 %>% dplyr::select(-zipcode)
house2 <- house2 %>% dplyr::select(-yr_renovated)
head(house2)
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1 221900 3 1.00 1180 5650 1 0 0
## 2 538000 3 2.25 2570 7242 2 0 0
## 3 180000 2 1.00 770 10000 1 0 0
## 4 604000 4 3.00 1960 5000 1 0 0
## 5 510000 3 2.00 1680 8080 1 0 0
## 6 1225000 4 4.50 5420 101930 1 0 0
## condition grade sqft_above sqft_basement yr_built sqft_living15 sqft_lot15
## 1 3 7 1180 0 2 1340 5650
## 2 3 7 2170 400 2 1690 7639
## 3 3 6 770 0 1 2720 8062
## 4 5 7 1050 910 3 1360 5000
## 5 3 8 1680 0 4 1800 7503
## 6 3 11 3890 1530 5 4760 101930
## good_neigh renovated
## 1 0 0
## 2 1 1
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
set.seed(1) ##for reproducibility to get the same split
sample<-sample.int(nrow(house), floor(.80*nrow(house2)), replace = F)
train2 <- house2[sample, ] ##training data frame
test2 <- house2[-sample, ] ##test data frame
head(train2)
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 17401 550000 3 1.75 2910 35200 1.5 0 0
## 4775 275000 4 2.50 2120 6754 2.0 0 0
## 13218 455000 5 2.00 1510 3000 2.0 0 0
## 10539 384950 3 2.50 1860 3690 2.0 0 0
## 8462 140000 2 1.00 900 6400 1.0 0 0
## 4050 925000 3 2.50 2690 7000 2.0 0 0
## condition grade sqft_above sqft_basement yr_built sqft_living15
## 17401 3 8 2910 0 3 2590
## 4775 3 7 2120 0 4 2120
## 13218 3 6 1510 0 4 1610
## 10539 3 7 1860 0 5 1870
## 8462 2 6 900 0 2 1350
## 4050 5 7 1840 850 2 1800
## sqft_lot15 good_neigh renovated
## 17401 37500 1 0
## 4775 6937 0 0
## 13218 3600 1 0
## 10539 4394 1 0
## 8462 6405 0 0
## 4050 6435 1 0
Like we used to do in the beginning, let’s take on automated search procedure to filter out predictors.
regnull <- lm(price ~ 1, data = train2)
regfull <- lm(price ~ ., data = train2)
step(regnull, scope = list(lower = regnull, upper = regfull), direction = "both")
## Start: AIC=442775.7
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + sqft_living 1 1.1329e+15 1.1553e+15 430962
## + grade 1 1.0242e+15 1.2640e+15 432516
## + sqft_above 1 8.4127e+14 1.4469e+15 434853
## + sqft_living15 1 7.8905e+14 1.4992e+15 435466
## + bathrooms 1 6.2379e+14 1.6644e+15 437274
## + good_neigh 1 3.7528e+14 1.9129e+15 439680
## + view 1 3.5639e+14 1.9318e+15 439850
## + sqft_basement 1 2.3919e+14 2.0490e+15 440869
## + bedrooms 1 2.0920e+14 2.0790e+15 441120
## + waterfront 1 1.7097e+14 2.1172e+15 441435
## + floors 1 1.6137e+14 2.1268e+15 441513
## + renovated 1 3.5656e+13 2.2525e+15 442506
## + sqft_lot 1 2.0037e+13 2.2682e+15 442626
## + sqft_lot15 1 1.5906e+13 2.2723e+15 442657
## + yr_built 1 5.6435e+12 2.2826e+15 442735
## + condition 1 3.4753e+12 2.2847e+15 442751
## <none> 2.2882e+15 442776
##
## Step: AIC=430962
## price ~ sqft_living
##
## Df Sum of Sq RSS AIC
## + good_neigh 1 2.2094e+14 9.3438e+14 427294
## + view 1 9.6282e+13 1.0590e+15 429459
## + grade 1 9.6101e+13 1.0592e+15 429462
## + waterfront 1 9.0018e+13 1.0653e+15 429561
## + yr_built 1 6.8189e+13 1.0871e+15 429912
## + bedrooms 1 3.3062e+13 1.1223e+15 430462
## + renovated 1 1.6775e+13 1.1386e+15 430711
## + sqft_living15 1 1.6529e+13 1.1388e+15 430715
## + condition 1 1.3494e+13 1.1418e+15 430761
## + sqft_lot15 1 6.0106e+12 1.1493e+15 430874
## + sqft_lot 1 3.2768e+12 1.1520e+15 430915
## + sqft_above 1 1.1799e+12 1.1541e+15 430946
## + sqft_basement 1 1.1799e+12 1.1541e+15 430946
## + floors 1 3.1999e+11 1.1550e+15 430959
## + bathrooms 1 2.4923e+11 1.1551e+15 430960
## <none> 1.1553e+15 430962
## - sqft_living 1 1.1329e+15 2.2882e+15 442776
##
## Step: AIC=427294.2
## price ~ sqft_living + good_neigh
##
## Df Sum of Sq RSS AIC
## + waterfront 1 1.0049e+14 8.3390e+14 425329
## + view 1 9.3769e+13 8.4061e+14 425468
## + grade 1 4.8715e+13 8.8567e+14 426370
## + yr_built 1 4.1487e+13 8.9289e+14 426511
## + bedrooms 1 2.6106e+13 9.0828e+14 426806
## + renovated 1 1.4132e+13 9.2025e+14 427033
## + condition 1 1.1442e+13 9.2294e+14 427083
## + sqft_living15 1 8.4179e+12 9.2596e+14 427140
## + bathrooms 1 7.2292e+11 9.3366e+14 427283
## + floors 1 5.1669e+11 9.3386e+14 427287
## + sqft_lot 1 1.6274e+11 9.3422e+14 427293
## <none> 9.3438e+14 427294
## + sqft_lot15 1 9.4700e+09 9.3437e+14 427296
## + sqft_above 1 1.7768e+07 9.3438e+14 427296
## + sqft_basement 1 1.7768e+07 9.3438e+14 427296
## - good_neigh 1 2.2094e+14 1.1553e+15 430962
## - sqft_living 1 9.7854e+14 1.9129e+15 439680
##
## Step: AIC=425329
## price ~ sqft_living + good_neigh + waterfront
##
## Df Sum of Sq RSS AIC
## + grade 1 4.6444e+13 7.8745e+14 424340
## + view 1 3.9679e+13 7.9422e+14 424488
## + yr_built 1 3.3563e+13 8.0033e+14 424621
## + bedrooms 1 1.7950e+13 8.1595e+14 424955
## + condition 1 1.0031e+13 8.2386e+14 425122
## + renovated 1 8.3178e+12 8.2558e+14 425158
## + sqft_living15 1 7.8908e+12 8.2600e+14 425167
## + floors 1 3.9438e+11 8.3350e+14 425323
## + bathrooms 1 3.7414e+11 8.3352e+14 425323
## + sqft_lot 1 2.4130e+11 8.3365e+14 425326
## + sqft_above 1 2.3809e+11 8.3366e+14 425326
## + sqft_basement 1 2.3809e+11 8.3366e+14 425326
## <none> 8.3390e+14 425329
## + sqft_lot15 1 3.0893e+10 8.3386e+14 425330
## - waterfront 1 1.0049e+14 9.3438e+14 427294
## - good_neigh 1 2.3141e+14 1.0653e+15 429561
## - sqft_living 1 8.9762e+14 1.7315e+15 437960
##
## Step: AIC=424340.1
## price ~ sqft_living + good_neigh + waterfront + grade
##
## Df Sum of Sq RSS AIC
## + yr_built 1 7.8000e+13 7.0945e+14 422539
## + view 1 3.4827e+13 7.5262e+14 423560
## + condition 1 1.8697e+13 7.6875e+14 423927
## + bedrooms 1 1.0676e+13 7.7678e+14 424106
## + renovated 1 1.0488e+13 7.7696e+14 424110
## + floors 1 7.7556e+12 7.7970e+14 424171
## + bathrooms 1 4.5855e+12 7.8287e+14 424241
## + sqft_above 1 2.8014e+12 7.8465e+14 424281
## + sqft_basement 1 2.8014e+12 7.8465e+14 424281
## + sqft_living15 1 4.2468e+11 7.8703e+14 424333
## + sqft_lot 1 2.9435e+11 7.8716e+14 424336
## <none> 7.8745e+14 424340
## + sqft_lot15 1 1.4095e+10 7.8744e+14 424342
## - grade 1 4.6444e+13 8.3390e+14 425329
## - waterfront 1 9.8214e+13 8.8567e+14 426370
## - good_neigh 1 1.8343e+14 9.7088e+14 427959
## - sqft_living 1 2.0959e+14 9.9704e+14 428418
##
## Step: AIC=422538.6
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built
##
## Df Sum of Sq RSS AIC
## + view 1 1.9437e+13 6.9001e+14 422060
## + bedrooms 1 1.0695e+13 6.9876e+14 422278
## + condition 1 2.3116e+12 7.0714e+14 422484
## + bathrooms 1 1.9274e+12 7.0752e+14 422494
## + renovated 1 1.3410e+12 7.0811e+14 422508
## + sqft_living15 1 9.3984e+11 7.0851e+14 422518
## + floors 1 3.5352e+11 7.0910e+14 422532
## + sqft_lot 1 1.4158e+11 7.0931e+14 422537
## + sqft_above 1 1.1899e+11 7.0933e+14 422538
## + sqft_basement 1 1.1899e+11 7.0933e+14 422538
## <none> 7.0945e+14 422539
## + sqft_lot15 1 2.3719e+10 7.0943e+14 422540
## - yr_built 1 7.8000e+13 7.8745e+14 424340
## - waterfront 1 8.4681e+13 7.9413e+14 424486
## - grade 1 9.0881e+13 8.0033e+14 424621
## - good_neigh 1 1.3021e+14 8.3966e+14 425450
## - sqft_living 1 1.9636e+14 9.0581e+14 426761
##
## Step: AIC=422060.3
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view
##
## Df Sum of Sq RSS AIC
## + bedrooms 1 8.5389e+12 6.8148e+14 421847
## + condition 1 2.1251e+12 6.8789e+14 422009
## + bathrooms 1 1.7060e+12 6.8831e+14 422020
## + sqft_above 1 1.2459e+12 6.8877e+14 422031
## + sqft_basement 1 1.2459e+12 6.8877e+14 422031
## + renovated 1 1.0939e+12 6.8892e+14 422035
## + floors 1 5.9382e+11 6.8942e+14 422047
## + sqft_living15 1 2.5814e+11 6.8976e+14 422056
## <none> 6.9001e+14 422060
## + sqft_lot 1 7.8955e+10 6.8994e+14 422060
## + sqft_lot15 1 5.4919e+10 6.8996e+14 422061
## - view 1 1.9437e+13 7.0945e+14 422539
## - waterfront 1 4.6785e+13 7.3680e+14 423193
## - yr_built 1 6.2610e+13 7.5262e+14 423560
## - grade 1 7.9779e+13 7.6979e+14 423950
## - good_neigh 1 1.3270e+14 8.2272e+14 425100
## - sqft_living 1 1.7908e+14 8.6910e+14 426048
##
## Step: AIC=421847
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms
##
## Df Sum of Sq RSS AIC
## + bathrooms 1 3.8353e+12 6.7764e+14 421751
## + condition 1 2.7023e+12 6.7877e+14 421780
## + renovated 1 1.0256e+12 6.8045e+14 421823
## + sqft_above 1 9.6664e+11 6.8051e+14 421824
## + sqft_basement 1 9.6664e+11 6.8051e+14 421824
## + floors 1 6.3527e+11 6.8084e+14 421833
## + sqft_lot15 1 3.0405e+11 6.8117e+14 421841
## + sqft_living15 1 1.8797e+11 6.8129e+14 421844
## <none> 6.8148e+14 421847
## + sqft_lot 1 1.7386e+08 6.8148e+14 421849
## - bedrooms 1 8.5389e+12 6.9001e+14 422060
## - view 1 1.7281e+13 6.9876e+14 422278
## - waterfront 1 4.4688e+13 7.2616e+14 422943
## - yr_built 1 6.3278e+13 7.4475e+14 423380
## - grade 1 7.2012e+13 7.5349e+14 423582
## - good_neigh 1 1.3142e+14 8.1290e+14 424894
## - sqft_living 1 1.6973e+14 8.5120e+14 425690
##
## Step: AIC=421751.4
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms
##
## Df Sum of Sq RSS AIC
## + condition 1 2.5458e+12 6.7509e+14 421688
## + sqft_above 1 1.2960e+12 6.7634e+14 421720
## + sqft_basement 1 1.2960e+12 6.7634e+14 421720
## + renovated 1 5.6770e+11 6.7707e+14 421739
## + sqft_living15 1 4.1997e+11 6.7722e+14 421743
## + sqft_lot15 1 1.8687e+11 6.7745e+14 421749
## + floors 1 1.3107e+11 6.7751e+14 421750
## <none> 6.7764e+14 421751
## + sqft_lot 1 5.1651e+09 6.7764e+14 421753
## - bathrooms 1 3.8353e+12 6.8148e+14 421847
## - bedrooms 1 1.0668e+13 6.8831e+14 422020
## - view 1 1.6672e+13 6.9431e+14 422170
## - waterfront 1 4.4549e+13 7.2219e+14 422850
## - yr_built 1 6.5320e+13 7.4296e+14 423341
## - grade 1 6.8072e+13 7.4571e+14 423404
## - sqft_living 1 1.2077e+14 7.9841e+14 424585
## - good_neigh 1 1.2765e+14 8.0529e+14 424734
##
## Step: AIC=421688.3
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition
##
## Df Sum of Sq RSS AIC
## + sqft_above 1 1.7795e+12 6.7332e+14 421645
## + sqft_basement 1 1.7795e+12 6.7332e+14 421645
## + renovated 1 1.0676e+12 6.7403e+14 421663
## + sqft_living15 1 4.5095e+11 6.7464e+14 421679
## + floors 1 3.2529e+11 6.7477e+14 421682
## + sqft_lot15 1 2.0826e+11 6.7489e+14 421685
## <none> 6.7509e+14 421688
## + sqft_lot 1 4.5832e+09 6.7509e+14 421690
## - condition 1 2.5458e+12 6.7764e+14 421751
## - bathrooms 1 3.6788e+12 6.7877e+14 421780
## - bedrooms 1 1.1226e+13 6.8632e+14 421971
## - view 1 1.6427e+13 6.9152e+14 422102
## - waterfront 1 4.4613e+13 7.1971e+14 422793
## - yr_built 1 5.1992e+13 7.2709e+14 422969
## - grade 1 6.8864e+13 7.4396e+14 423366
## - sqft_living 1 1.2018e+14 7.9528e+14 424519
## - good_neigh 1 1.2824e+14 8.0334e+14 424693
##
## Step: AIC=421644.7
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above
##
## Df Sum of Sq RSS AIC
## + renovated 1 9.9938e+11 6.7232e+14 421621
## + sqft_lot15 1 3.0093e+11 6.7301e+14 421639
## + sqft_living15 1 2.2173e+11 6.7309e+14 421641
## <none> 6.7332e+14 421645
## + floors 1 1.2372e+09 6.7331e+14 421647
## + sqft_lot 1 6.7932e+08 6.7331e+14 421647
## - sqft_above 1 1.7795e+12 6.7509e+14 421688
## - condition 1 3.0293e+12 6.7634e+14 421720
## - bathrooms 1 4.0513e+12 6.7737e+14 421746
## - bedrooms 1 1.1021e+13 6.8434e+14 421923
## - view 1 1.7822e+13 6.9114e+14 422094
## - waterfront 1 4.4101e+13 7.1742e+14 422740
## - sqft_living 1 5.2096e+13 7.2541e+14 422931
## - yr_built 1 5.3763e+13 7.2708e+14 422971
## - grade 1 6.0022e+13 7.3334e+14 423119
## - good_neigh 1 1.2997e+14 8.0329e+14 424694
##
## Step: AIC=421621
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above + renovated
##
## Df Sum of Sq RSS AIC
## + sqft_lot15 1 3.1404e+11 6.7200e+14 421615
## + sqft_living15 1 2.7048e+11 6.7205e+14 421616
## <none> 6.7232e+14 421621
## + sqft_lot 1 9.6491e+08 6.7231e+14 421623
## + floors 1 1.2630e+08 6.7232e+14 421623
## - renovated 1 9.9938e+11 6.7332e+14 421645
## - sqft_above 1 1.7112e+12 6.7403e+14 421663
## - bathrooms 1 3.4175e+12 6.7573e+14 421707
## - condition 1 3.5334e+12 6.7585e+14 421710
## - bedrooms 1 1.0811e+13 6.8313e+14 421895
## - view 1 1.7581e+13 6.8990e+14 422065
## - waterfront 1 4.3210e+13 7.1553e+14 422696
## - yr_built 1 4.4719e+13 7.1703e+14 422732
## - sqft_living 1 5.2405e+13 7.2472e+14 422917
## - grade 1 5.9879e+13 7.3219e+14 423094
## - good_neigh 1 1.3046e+14 8.0278e+14 424685
##
## Step: AIC=421615
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above + renovated +
## sqft_lot15
##
## Df Sum of Sq RSS AIC
## + sqft_living15 1 3.0923e+11 6.7169e+14 421609
## + sqft_lot 1 2.9174e+11 6.7171e+14 421609
## <none> 6.7200e+14 421615
## + floors 1 4.0939e+09 6.7200e+14 421617
## - sqft_lot15 1 3.1404e+11 6.7232e+14 421621
## - renovated 1 1.0125e+12 6.7301e+14 421639
## - sqft_above 1 1.8041e+12 6.7381e+14 421659
## - bathrooms 1 3.2882e+12 6.7529e+14 421697
## - condition 1 3.5833e+12 6.7558e+14 421705
## - bedrooms 1 1.1059e+13 6.8306e+14 421895
## - view 1 1.7678e+13 6.8968e+14 422062
## - waterfront 1 4.3153e+13 7.1515e+14 422689
## - yr_built 1 4.4582e+13 7.1658e+14 422724
## - sqft_living 1 5.2705e+13 7.2471e+14 422918
## - grade 1 5.9546e+13 7.3155e+14 423081
## - good_neigh 1 1.2606e+14 7.9806e+14 424586
##
## Step: AIC=421609
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above + renovated +
## sqft_lot15 + sqft_living15
##
## Df Sum of Sq RSS AIC
## + sqft_lot 1 3.1649e+11 6.7138e+14 421603
## <none> 6.7169e+14 421609
## + floors 1 8.0363e+08 6.7169e+14 421611
## - sqft_living15 1 3.0923e+11 6.7200e+14 421615
## - sqft_lot15 1 3.5279e+11 6.7205e+14 421616
## - renovated 1 1.0660e+12 6.7276e+14 421634
## - sqft_above 1 1.5454e+12 6.7324e+14 421647
## - bathrooms 1 3.4379e+12 6.7513e+14 421695
## - condition 1 3.5957e+12 6.7529e+14 421699
## - bedrooms 1 1.1081e+13 6.8277e+14 421890
## - view 1 1.6748e+13 6.8844e+14 422033
## - waterfront 1 4.3361e+13 7.1505e+14 422689
## - yr_built 1 4.4802e+13 7.1649e+14 422723
## - sqft_living 1 4.8255e+13 7.1995e+14 422807
## - grade 1 5.3809e+13 7.2550e+14 422939
## - good_neigh 1 1.2502e+14 7.9672e+14 424558
##
## Step: AIC=421602.8
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above + renovated +
## sqft_lot15 + sqft_living15 + sqft_lot
##
## Df Sum of Sq RSS AIC
## <none> 6.7138e+14 421603
## + floors 1 2.4246e+09 6.7137e+14 421605
## - sqft_lot 1 3.1649e+11 6.7169e+14 421609
## - sqft_living15 1 3.3398e+11 6.7171e+14 421609
## - sqft_lot15 1 6.6766e+11 6.7204e+14 421618
## - renovated 1 1.0742e+12 6.7245e+14 421628
## - sqft_above 1 1.4933e+12 6.7287e+14 421639
## - bathrooms 1 3.4356e+12 6.7481e+14 421689
## - condition 1 3.6152e+12 6.7499e+14 421694
## - bedrooms 1 1.0953e+13 6.8233e+14 421881
## - view 1 1.6649e+13 6.8802e+14 422024
## - waterfront 1 4.3541e+13 7.1492e+14 422687
## - yr_built 1 4.4574e+13 7.1595e+14 422712
## - sqft_living 1 4.7994e+13 7.1937e+14 422795
## - grade 1 5.3784e+13 7.2516e+14 422933
## - good_neigh 1 1.2534e+14 7.9671e+14 424560
##
## Call:
## lm(formula = price ~ sqft_living + good_neigh + waterfront +
## grade + yr_built + view + bedrooms + bathrooms + condition +
## sqft_above + renovated + sqft_lot15 + sqft_living15 + sqft_lot,
## data = train2)
##
## Coefficients:
## (Intercept) sqft_living good_neigh waterfront grade
## -5.535e+05 1.603e+02 1.908e+05 6.392e+05 8.721e+04
## yr_built view bedrooms bathrooms condition
## -4.823e+04 4.781e+04 -3.451e+04 3.222e+04 2.423e+04
## sqft_above renovated sqft_lot15 sqft_living15 sqft_lot
## 2.638e+01 4.198e+04 -3.387e-01 1.067e+01 1.643e-01
step(regnull, scope=list(lower=regnull, upper=regfull), direction="forward")
## Start: AIC=442775.7
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + sqft_living 1 1.1329e+15 1.1553e+15 430962
## + grade 1 1.0242e+15 1.2640e+15 432516
## + sqft_above 1 8.4127e+14 1.4469e+15 434853
## + sqft_living15 1 7.8905e+14 1.4992e+15 435466
## + bathrooms 1 6.2379e+14 1.6644e+15 437274
## + good_neigh 1 3.7528e+14 1.9129e+15 439680
## + view 1 3.5639e+14 1.9318e+15 439850
## + sqft_basement 1 2.3919e+14 2.0490e+15 440869
## + bedrooms 1 2.0920e+14 2.0790e+15 441120
## + waterfront 1 1.7097e+14 2.1172e+15 441435
## + floors 1 1.6137e+14 2.1268e+15 441513
## + renovated 1 3.5656e+13 2.2525e+15 442506
## + sqft_lot 1 2.0037e+13 2.2682e+15 442626
## + sqft_lot15 1 1.5906e+13 2.2723e+15 442657
## + yr_built 1 5.6435e+12 2.2826e+15 442735
## + condition 1 3.4753e+12 2.2847e+15 442751
## <none> 2.2882e+15 442776
##
## Step: AIC=430962
## price ~ sqft_living
##
## Df Sum of Sq RSS AIC
## + good_neigh 1 2.2094e+14 9.3438e+14 427294
## + view 1 9.6282e+13 1.0590e+15 429459
## + grade 1 9.6101e+13 1.0592e+15 429462
## + waterfront 1 9.0018e+13 1.0653e+15 429561
## + yr_built 1 6.8189e+13 1.0871e+15 429912
## + bedrooms 1 3.3062e+13 1.1223e+15 430462
## + renovated 1 1.6775e+13 1.1386e+15 430711
## + sqft_living15 1 1.6529e+13 1.1388e+15 430715
## + condition 1 1.3494e+13 1.1418e+15 430761
## + sqft_lot15 1 6.0106e+12 1.1493e+15 430874
## + sqft_lot 1 3.2768e+12 1.1520e+15 430915
## + sqft_above 1 1.1799e+12 1.1541e+15 430946
## + sqft_basement 1 1.1799e+12 1.1541e+15 430946
## + floors 1 3.1999e+11 1.1550e+15 430959
## + bathrooms 1 2.4923e+11 1.1551e+15 430960
## <none> 1.1553e+15 430962
##
## Step: AIC=427294.2
## price ~ sqft_living + good_neigh
##
## Df Sum of Sq RSS AIC
## + waterfront 1 1.0049e+14 8.3390e+14 425329
## + view 1 9.3769e+13 8.4061e+14 425468
## + grade 1 4.8715e+13 8.8567e+14 426370
## + yr_built 1 4.1487e+13 8.9289e+14 426511
## + bedrooms 1 2.6106e+13 9.0828e+14 426806
## + renovated 1 1.4132e+13 9.2025e+14 427033
## + condition 1 1.1442e+13 9.2294e+14 427083
## + sqft_living15 1 8.4179e+12 9.2596e+14 427140
## + bathrooms 1 7.2292e+11 9.3366e+14 427283
## + floors 1 5.1669e+11 9.3386e+14 427287
## + sqft_lot 1 1.6274e+11 9.3422e+14 427293
## <none> 9.3438e+14 427294
## + sqft_lot15 1 9.4700e+09 9.3437e+14 427296
## + sqft_above 1 1.7768e+07 9.3438e+14 427296
## + sqft_basement 1 1.7768e+07 9.3438e+14 427296
##
## Step: AIC=425329
## price ~ sqft_living + good_neigh + waterfront
##
## Df Sum of Sq RSS AIC
## + grade 1 4.6444e+13 7.8745e+14 424340
## + view 1 3.9679e+13 7.9422e+14 424488
## + yr_built 1 3.3563e+13 8.0033e+14 424621
## + bedrooms 1 1.7950e+13 8.1595e+14 424955
## + condition 1 1.0031e+13 8.2386e+14 425122
## + renovated 1 8.3178e+12 8.2558e+14 425158
## + sqft_living15 1 7.8908e+12 8.2600e+14 425167
## + floors 1 3.9438e+11 8.3350e+14 425323
## + bathrooms 1 3.7414e+11 8.3352e+14 425323
## + sqft_lot 1 2.4130e+11 8.3365e+14 425326
## + sqft_above 1 2.3809e+11 8.3366e+14 425326
## + sqft_basement 1 2.3809e+11 8.3366e+14 425326
## <none> 8.3390e+14 425329
## + sqft_lot15 1 3.0893e+10 8.3386e+14 425330
##
## Step: AIC=424340.1
## price ~ sqft_living + good_neigh + waterfront + grade
##
## Df Sum of Sq RSS AIC
## + yr_built 1 7.8000e+13 7.0945e+14 422539
## + view 1 3.4827e+13 7.5262e+14 423560
## + condition 1 1.8697e+13 7.6875e+14 423927
## + bedrooms 1 1.0676e+13 7.7678e+14 424106
## + renovated 1 1.0488e+13 7.7696e+14 424110
## + floors 1 7.7556e+12 7.7970e+14 424171
## + bathrooms 1 4.5855e+12 7.8287e+14 424241
## + sqft_above 1 2.8014e+12 7.8465e+14 424281
## + sqft_basement 1 2.8014e+12 7.8465e+14 424281
## + sqft_living15 1 4.2468e+11 7.8703e+14 424333
## + sqft_lot 1 2.9435e+11 7.8716e+14 424336
## <none> 7.8745e+14 424340
## + sqft_lot15 1 1.4095e+10 7.8744e+14 424342
##
## Step: AIC=422538.6
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built
##
## Df Sum of Sq RSS AIC
## + view 1 1.9437e+13 6.9001e+14 422060
## + bedrooms 1 1.0695e+13 6.9876e+14 422278
## + condition 1 2.3116e+12 7.0714e+14 422484
## + bathrooms 1 1.9274e+12 7.0752e+14 422494
## + renovated 1 1.3410e+12 7.0811e+14 422508
## + sqft_living15 1 9.3984e+11 7.0851e+14 422518
## + floors 1 3.5352e+11 7.0910e+14 422532
## + sqft_lot 1 1.4158e+11 7.0931e+14 422537
## + sqft_above 1 1.1899e+11 7.0933e+14 422538
## + sqft_basement 1 1.1899e+11 7.0933e+14 422538
## <none> 7.0945e+14 422539
## + sqft_lot15 1 2.3719e+10 7.0943e+14 422540
##
## Step: AIC=422060.3
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view
##
## Df Sum of Sq RSS AIC
## + bedrooms 1 8.5389e+12 6.8148e+14 421847
## + condition 1 2.1251e+12 6.8789e+14 422009
## + bathrooms 1 1.7060e+12 6.8831e+14 422020
## + sqft_above 1 1.2459e+12 6.8877e+14 422031
## + sqft_basement 1 1.2459e+12 6.8877e+14 422031
## + renovated 1 1.0939e+12 6.8892e+14 422035
## + floors 1 5.9382e+11 6.8942e+14 422047
## + sqft_living15 1 2.5814e+11 6.8976e+14 422056
## <none> 6.9001e+14 422060
## + sqft_lot 1 7.8955e+10 6.8994e+14 422060
## + sqft_lot15 1 5.4919e+10 6.8996e+14 422061
##
## Step: AIC=421847
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms
##
## Df Sum of Sq RSS AIC
## + bathrooms 1 3.8353e+12 6.7764e+14 421751
## + condition 1 2.7023e+12 6.7877e+14 421780
## + renovated 1 1.0256e+12 6.8045e+14 421823
## + sqft_above 1 9.6664e+11 6.8051e+14 421824
## + sqft_basement 1 9.6664e+11 6.8051e+14 421824
## + floors 1 6.3527e+11 6.8084e+14 421833
## + sqft_lot15 1 3.0405e+11 6.8117e+14 421841
## + sqft_living15 1 1.8797e+11 6.8129e+14 421844
## <none> 6.8148e+14 421847
## + sqft_lot 1 1.7386e+08 6.8148e+14 421849
##
## Step: AIC=421751.4
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms
##
## Df Sum of Sq RSS AIC
## + condition 1 2.5458e+12 6.7509e+14 421688
## + sqft_above 1 1.2960e+12 6.7634e+14 421720
## + sqft_basement 1 1.2960e+12 6.7634e+14 421720
## + renovated 1 5.6770e+11 6.7707e+14 421739
## + sqft_living15 1 4.1997e+11 6.7722e+14 421743
## + sqft_lot15 1 1.8687e+11 6.7745e+14 421749
## + floors 1 1.3107e+11 6.7751e+14 421750
## <none> 6.7764e+14 421751
## + sqft_lot 1 5.1651e+09 6.7764e+14 421753
##
## Step: AIC=421688.3
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition
##
## Df Sum of Sq RSS AIC
## + sqft_above 1 1.7795e+12 6.7332e+14 421645
## + sqft_basement 1 1.7795e+12 6.7332e+14 421645
## + renovated 1 1.0676e+12 6.7403e+14 421663
## + sqft_living15 1 4.5095e+11 6.7464e+14 421679
## + floors 1 3.2529e+11 6.7477e+14 421682
## + sqft_lot15 1 2.0826e+11 6.7489e+14 421685
## <none> 6.7509e+14 421688
## + sqft_lot 1 4.5832e+09 6.7509e+14 421690
##
## Step: AIC=421644.7
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above
##
## Df Sum of Sq RSS AIC
## + renovated 1 9.9938e+11 6.7232e+14 421621
## + sqft_lot15 1 3.0093e+11 6.7301e+14 421639
## + sqft_living15 1 2.2173e+11 6.7309e+14 421641
## <none> 6.7332e+14 421645
## + floors 1 1.2372e+09 6.7331e+14 421647
## + sqft_lot 1 6.7932e+08 6.7331e+14 421647
##
## Step: AIC=421621
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above + renovated
##
## Df Sum of Sq RSS AIC
## + sqft_lot15 1 3.1404e+11 6.7200e+14 421615
## + sqft_living15 1 2.7048e+11 6.7205e+14 421616
## <none> 6.7232e+14 421621
## + sqft_lot 1 9.6491e+08 6.7231e+14 421623
## + floors 1 1.2630e+08 6.7232e+14 421623
##
## Step: AIC=421615
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above + renovated +
## sqft_lot15
##
## Df Sum of Sq RSS AIC
## + sqft_living15 1 3.0923e+11 6.7169e+14 421609
## + sqft_lot 1 2.9174e+11 6.7171e+14 421609
## <none> 6.7200e+14 421615
## + floors 1 4.0939e+09 6.7200e+14 421617
##
## Step: AIC=421609
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above + renovated +
## sqft_lot15 + sqft_living15
##
## Df Sum of Sq RSS AIC
## + sqft_lot 1 3.1649e+11 6.7138e+14 421603
## <none> 6.7169e+14 421609
## + floors 1 8.0363e+08 6.7169e+14 421611
##
## Step: AIC=421602.8
## price ~ sqft_living + good_neigh + waterfront + grade + yr_built +
## view + bedrooms + bathrooms + condition + sqft_above + renovated +
## sqft_lot15 + sqft_living15 + sqft_lot
##
## Df Sum of Sq RSS AIC
## <none> 6.7138e+14 421603
## + floors 1 2424591349 6.7137e+14 421605
##
## Call:
## lm(formula = price ~ sqft_living + good_neigh + waterfront +
## grade + yr_built + view + bedrooms + bathrooms + condition +
## sqft_above + renovated + sqft_lot15 + sqft_living15 + sqft_lot,
## data = train2)
##
## Coefficients:
## (Intercept) sqft_living good_neigh waterfront grade
## -5.535e+05 1.603e+02 1.908e+05 6.392e+05 8.721e+04
## yr_built view bedrooms bathrooms condition
## -4.823e+04 4.781e+04 -3.451e+04 3.222e+04 2.423e+04
## sqft_above renovated sqft_lot15 sqft_living15 sqft_lot
## 2.638e+01 4.198e+04 -3.387e-01 1.067e+01 1.643e-01
step(regfull, scope=list(lower=regnull, upper=regfull), direction="backward")
## Start: AIC=421604.8
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + sqft_basement +
## yr_built + sqft_living15 + sqft_lot15 + good_neigh + renovated
##
##
## Step: AIC=421604.8
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + yr_built +
## sqft_living15 + sqft_lot15 + good_neigh + renovated
##
## Df Sum of Sq RSS AIC
## - floors 1 2.4246e+09 6.7138e+14 421603
## <none> 6.7137e+14 421605
## - sqft_lot 1 3.1811e+11 6.7169e+14 421611
## - sqft_living15 1 3.3412e+11 6.7171e+14 421611
## - sqft_lot15 1 6.6443e+11 6.7204e+14 421620
## - renovated 1 1.0684e+12 6.7244e+14 421630
## - sqft_above 1 1.1720e+12 6.7255e+14 421633
## - bathrooms 1 3.1610e+12 6.7453e+14 421684
## - condition 1 3.6071e+12 6.7498e+14 421695
## - bedrooms 1 1.0929e+13 6.8230e+14 421882
## - view 1 1.6588e+13 6.8796e+14 422025
## - yr_built 1 4.3150e+13 7.1452e+14 422680
## - waterfront 1 4.3530e+13 7.1490e+14 422689
## - sqft_living 1 4.4329e+13 7.1570e+14 422708
## - grade 1 5.3268e+13 7.2464e+14 422923
## - good_neigh 1 1.2264e+14 7.9402e+14 424504
##
## Step: AIC=421602.8
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront +
## view + condition + grade + sqft_above + yr_built + sqft_living15 +
## sqft_lot15 + good_neigh + renovated
##
## Df Sum of Sq RSS AIC
## <none> 6.7138e+14 421603
## - sqft_lot 1 3.1649e+11 6.7169e+14 421609
## - sqft_living15 1 3.3398e+11 6.7171e+14 421609
## - sqft_lot15 1 6.6766e+11 6.7204e+14 421618
## - renovated 1 1.0742e+12 6.7245e+14 421628
## - sqft_above 1 1.4933e+12 6.7287e+14 421639
## - bathrooms 1 3.4356e+12 6.7481e+14 421689
## - condition 1 3.6152e+12 6.7499e+14 421694
## - bedrooms 1 1.0953e+13 6.8233e+14 421881
## - view 1 1.6649e+13 6.8802e+14 422024
## - waterfront 1 4.3541e+13 7.1492e+14 422687
## - yr_built 1 4.4574e+13 7.1595e+14 422712
## - sqft_living 1 4.7994e+13 7.1937e+14 422795
## - grade 1 5.3784e+13 7.2516e+14 422933
## - good_neigh 1 1.2534e+14 7.9671e+14 424560
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## waterfront + view + condition + grade + sqft_above + yr_built +
## sqft_living15 + sqft_lot15 + good_neigh + renovated, data = train2)
##
## Coefficients:
## (Intercept) bedrooms bathrooms sqft_living sqft_lot
## -5.535e+05 -3.451e+04 3.222e+04 1.603e+02 1.643e-01
## waterfront view condition grade sqft_above
## 6.392e+05 4.781e+04 2.423e+04 8.721e+04 2.638e+01
## yr_built sqft_living15 sqft_lot15 good_neigh renovated
## -4.823e+04 1.067e+01 -3.387e-01 1.908e+05 4.198e+04
Total 14 colmns, were used in our model.
result5 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
waterfront + view + condition + grade + sqft_above + yr_built +
sqft_living15 + sqft_lot15 + renovated + good_neigh, data = train2)
summary(result5)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## waterfront + view + condition + grade + sqft_above + yr_built +
## sqft_living15 + sqft_lot15 + renovated + good_neigh, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1144540 -98940 -11414 73986 4394089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.535e+05 1.665e+04 -33.235 < 2e-16 ***
## bedrooms -3.451e+04 2.056e+03 -16.788 < 2e-16 ***
## bathrooms 3.222e+04 3.427e+03 9.402 < 2e-16 ***
## sqft_living 1.603e+02 4.563e+00 35.142 < 2e-16 ***
## sqft_lot 1.643e-01 5.758e-02 2.854 0.00433 **
## waterfront 6.392e+05 1.910e+04 33.471 < 2e-16 ***
## view 4.781e+04 2.310e+03 20.697 < 2e-16 ***
## condition 2.423e+04 2.512e+03 9.645 < 2e-16 ***
## grade 8.721e+04 2.344e+03 37.201 < 2e-16 ***
## sqft_above 2.638e+01 4.256e+00 6.199 5.83e-10 ***
## yr_built -4.823e+04 1.424e+03 -33.866 < 2e-16 ***
## sqft_living15 1.067e+01 3.640e+00 2.931 0.00338 **
## sqft_lot15 -3.387e-01 8.171e-02 -4.145 3.42e-05 ***
## renovated 4.198e+04 7.985e+03 5.257 1.48e-07 ***
## good_neigh 1.908e+05 3.360e+03 56.790 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 197100 on 17275 degrees of freedom
## Multiple R-squared: 0.7066, Adjusted R-squared: 0.7064
## F-statistic: 2972 on 14 and 17275 DF, p-value: < 2.2e-16
summary(result5)$r.squared
## [1] 0.7065922
summary(result5)$adj.r.squared
## [1] 0.7063545
PRESS(result5)
## [1] 6.759204e+14
##Find SST
anova_result<-anova(result5)
SST<-sum(anova_result$"Sum Sq")
##R2 pred
Rsq_pred <- 1-PRESS(result5)/SST
Rsq_pred
## [1] 0.7046062
However, when we take a look at the residual plot, constant variance is not satisfied.
yhat<-result5$fitted.values
res<-result5$residuals
Data<-data.frame(train2,yhat,res)
ggplot(Data, aes(x=yhat,y=res))+
geom_point()+
geom_hline(yintercept=0, color="red")+
labs(x="Fitted y",
y="Residuals",
title="Residual Plot")
acf(res)
qqnorm(res)
qqline(res, col="red")
To find optimal \(\lambda\) for y-transformation, we look at Box Cox plot, and \(\lambda = 0\), so let’s do log transformation on price.
boxcox(result5,lambda = seq(-1.,1,0.5))
Surprisingly, all of our result stats has increased significantly by implementing good neighbors column and log transformation on price.
train2 <- train2 %>% mutate(price = log(price))
result6 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
waterfront + view + condition + grade + sqft_above + yr_built +
sqft_living15 + sqft_lot15 + renovated + good_neigh, data = train2)
summary(result6)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## waterfront + view + condition + grade + sqft_above + yr_built +
## sqft_living15 + sqft_lot15 + renovated + good_neigh, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83399 -0.15020 -0.00719 0.14830 1.03942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.108e+01 2.077e-02 533.458 < 2e-16 ***
## bedrooms -1.112e-02 2.564e-03 -4.338 1.44e-05 ***
## bathrooms 7.205e-02 4.274e-03 16.860 < 2e-16 ***
## sqft_living 1.405e-04 5.691e-06 24.683 < 2e-16 ***
## sqft_lot 5.457e-07 7.181e-08 7.599 3.14e-14 ***
## waterfront 4.495e-01 2.382e-02 18.870 < 2e-16 ***
## view 5.621e-02 2.881e-03 19.513 < 2e-16 ***
## condition 4.954e-02 3.133e-03 15.811 < 2e-16 ***
## grade 1.416e-01 2.924e-03 48.429 < 2e-16 ***
## sqft_above 2.289e-05 5.309e-06 4.311 1.63e-05 ***
## yr_built -5.801e-02 1.776e-03 -32.661 < 2e-16 ***
## sqft_living15 6.832e-05 4.540e-06 15.048 < 2e-16 ***
## sqft_lot15 -2.088e-08 1.019e-07 -0.205 0.838
## renovated 6.888e-02 9.959e-03 6.916 4.79e-12 ***
## good_neigh 4.428e-01 4.190e-03 105.659 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2459 on 17275 degrees of freedom
## Multiple R-squared: 0.7817, Adjusted R-squared: 0.7815
## F-statistic: 4419 on 14 and 17275 DF, p-value: < 2.2e-16
test2$predict <- round(exp(predict(result6, newdata = test2)))
test_mse_ln_6 <- mean((test2$price - test2$predict)^2)
test_mse_ln_6
## [1] 35394126054
summary(result6)$r.squared
## [1] 0.781717
summary(result6)$adj.r.squared
## [1] 0.7815401
PRESS(result6)
## [1] 1047.15
##Find SST
anova_result<-anova(result6)
SST<-sum(anova_result$"Sum Sq")
##R2 pred
Rsq_pred <- 1-PRESS(result6)/SST
Rsq_pred
## [1] 0.7811394
yhat<-result6$fitted.values
res<-result6$residuals
Data<-data.frame(train2,yhat,res)
ggplot(Data, aes(x=yhat,y=res))+
geom_point()+
geom_hline(yintercept=0, color="red")+
labs(x="Fitted y",
y="Residuals",
title="Residual Plot")
acf(res)
qqnorm(res)
qqline(res, col="red")
However, as the p-value for sqft_lot15 predictor is high, let’s drop that and re-model it. As our model all satisfied regression assumption, and pretty good result, this is our final model.
\(y^* = 1.108e+01 -1.110e-02x_{bedrooms}+ 7.208e-02x_{bathrooms} + 1.404e-04x_{sqft_living} + 5.351e-07x_{sqft_lot} + 4.494e-01x_{waterfront} + 5.622e-02x_{view} + 4.952e-02x_{condition} + 1.416e-01x_{grade} + 2.287e-05x_{sqft_above} -5.802e-02x_{yr_built} + 6.825e-05x_{sqft_living15} + 6.885e-02x_{renovated} + 4.428e-01x_{good_neigh}\), where \(y^* = log(y)\)
result7 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
waterfront + view + condition + grade + sqft_above + yr_built +
sqft_living15 + renovated + good_neigh, data = train2)
summary(result7)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## waterfront + view + condition + grade + sqft_above + yr_built +
## sqft_living15 + renovated + good_neigh, data = train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83481 -0.15019 -0.00713 0.14838 1.03943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.108e+01 2.076e-02 533.668 < 2e-16 ***
## bedrooms -1.110e-02 2.562e-03 -4.334 1.47e-05 ***
## bathrooms 7.208e-02 4.271e-03 16.879 < 2e-16 ***
## sqft_living 1.404e-04 5.688e-06 24.690 < 2e-16 ***
## sqft_lot 5.351e-07 4.964e-08 10.778 < 2e-16 ***
## waterfront 4.494e-01 2.381e-02 18.870 < 2e-16 ***
## view 5.622e-02 2.881e-03 19.514 < 2e-16 ***
## condition 4.952e-02 3.132e-03 15.811 < 2e-16 ***
## grade 1.416e-01 2.922e-03 48.465 < 2e-16 ***
## sqft_above 2.287e-05 5.308e-06 4.309 1.65e-05 ***
## yr_built -5.802e-02 1.775e-03 -32.681 < 2e-16 ***
## sqft_living15 6.825e-05 4.528e-06 15.072 < 2e-16 ***
## renovated 6.885e-02 9.957e-03 6.914 4.87e-12 ***
## good_neigh 4.428e-01 4.181e-03 105.903 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2459 on 17276 degrees of freedom
## Multiple R-squared: 0.7817, Adjusted R-squared: 0.7816
## F-statistic: 4759 on 13 and 17276 DF, p-value: < 2.2e-16
test2$predict <- round(exp(predict(result7, newdata = test2)))
head(test2)
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 5 510000 3 2.00 1680 8080 1 0 0
## 9 229500 3 1.00 1780 7470 1 0 0
## 10 323000 3 2.50 1890 6560 2 0 0
## 12 468000 2 1.00 1160 6000 1 0 0
## 17 395000 3 2.00 1890 14040 2 0 0
## 22 2000000 3 2.75 3050 44867 1 0 4
## condition grade sqft_above sqft_basement yr_built sqft_living15 sqft_lot15
## 5 3 8 1680 0 4 1800 7503
## 9 3 7 1050 730 3 1780 8113
## 10 3 7 1890 0 5 2390 7570
## 12 4 7 860 300 2 1330 6000
## 17 3 7 1890 0 4 1890 14018
## 22 3 9 2330 720 3 4110 20336
## good_neigh renovated predict
## 5 1 0 481307
## 9 0 0 264003
## 10 0 0 282549
## 12 1 0 409323
## 17 0 0 280257
## 22 1 0 1141010
test_mse_ln_7 <- mean((test2$price - test2$predict)^2)
test_mse_ln_7
## [1] 35383710608
summary(result7)$r.squared
## [1] 0.7817165
summary(result7)$adj.r.squared
## [1] 0.7815522
PRESS(result7)
## [1] 1046.815
##Find SST
anova_result<-anova(result7)
SST<-sum(anova_result$"Sum Sq")
##R2 pred
Rsq_pred <- 1-PRESS(result7)/SST
Rsq_pred
## [1] 0.7812095
yhat<-result7$fitted.values
res<-result7$residuals
Data<-data.frame(train2,yhat,res)
ggplot(Data, aes(x=yhat,y=res))+
geom_point()+
geom_hline(yintercept=0, color="red")+
labs(x="Fitted y",
y="Residuals",
title="Residual Plot")
acf(res)
qqnorm(res)
qqline(res, col="red")